WWDC2004 Session 642
Transcript
Kind: captions
Language: en
I'm Warner Ewan and welcome to this
afternoon session on HPC software
optimization so once again a lot of
people are always wondering apple and
high-performance computing and in
reality what they think about is this
but the reality is there are a lot of
great tools for building
high-performance computers with Apple
hardware and today in this session we'll
be talking about some of these things
specifically what's new at Apple for
high-performance computing and also some
of the things for preparation of your
Apple hardware for high-performance
computing and a great introduction on
how to write parallel code so we don't
have time for to show you all of the
cool new Apple tools but I wanted to
give a plug for the new shud performance
optimization tools specifically tomorrow
there's a great session called got shark
that talks about performance profiling
high-performance compilers so this is
what has really allowed us to play in
the high performance computing world
specifically knew this week was the
64-bit compiling ability with a GCC in
your tiger previews in today's sessions
what you'll really what we'll get into
is performance benefits using mac OS
tens accelerate framework for
high-performance computing and we'll
have a presenter talking about
streamlining the OS services for
performance and another person will be
coming up to talk about what options are
available for building high-performance
computers with Apple hardware and then
we'll have our introductions on creating
parallel code with both MPI and also a
next-generation parallel computing
developer framework so who's talking
today we have some really great speakers
will have Steve Peters from Apple's
vector numerics group coming up to talk
about mac OS 10 will have a josh durham
from virginia tech that will come up and
talk about use of mac OS 10 for
high-performance computing and beamed
alger from down
research will come up to talk about
writing parallel MPI code and finally we
have Steve Ford from gridiron software
to talk about next generation parallel
development framework so with that I'm
just going to go to our first speaker
Steve Peters Thank You Warner I'm on the
air on wire I'm going to tell you about
the mathematical facilities in Mac OS 10
so the base of much of scientific high
performance computing that's available
for Mac OS 10 the agenda today is survey
the api's that we ship with Mac os10
tell you about new Tiger api's a little
bit of comparative performance to let
you know that we're I think leading the
field in sort of this price space anyway
and then reinforce the mantra that first
check out our frameworks when you're
interested in getting performance out of
the machines on math so we start out
with tips that are I Triple E 754 are
compliant like much of the industry this
is what gives us our substrate in math
they then layer on top of that lib m.a.c
99 compliance library full of the
elementary transcendental functions that
everyone knows and loves sine cosine
square roots and so forth we present
these in flow single double and long
double Precision's a complex and in the
real domain for linear algebra place
where many vendors add value and so do
we we take the basic linear algebra
subroutines as shipped in the Atlas open
source package do some additional tuning
and ship that as part of our accelerate
framework so if you're looking for any
of the blahs look first to the
accelerate framework where they've been
closely match to the hardware on each of
our platforms and those come in all the
familiar flavors and layered on top of
that is the gold standard of
dence numerical linear algebra solvers
la pack and again in all the familiar
flavors when we begin to talk about
performance we're looking for the g5 is
our flagship it's really gotten us into
a really interesting place in the HGC
space and in my opinion for scientific
work it's the dual floating-point cords
that have taken us there so for every
970 CPU you get to floating-point core
is capable of doing a double and single
precision I Triple E arithmetic the CPU
can dispatch to both of those floating
point instructions units on every cycle
it could start a new operation on every
cycle all the basic arithmetic
operations are present as well as
Hardware square root that's a new to the
Pat new to power pc as we've seen it
anyway and good for us the g5 also
offers a class of instructions called
hughes multiply add these are three
operand instructions basically
multiplies the first two together as the
third does that as one machine
instruction saving a rounding and adding
finishing the job in smaller number of
cycles and back-to-back multiply
followed by an so fuse multiply X uses
those together why do we make a fuss
about this or some people I've heard or
said oh that's a big deal Buffy's
multiply add well it's fundamental to
linear algebra it's the dot product is
the essential piece of the dot product
and that is all fundamental to matrix
multiplication so a big part of the fast
Fourier transform the butterflies are
essentially fused multiplied adds
multiply ads which we confuse and if
you're doing function evaluations say by
Horner's rule you'll arrange polynomials
in a way that can take advantage of
views multiply hands
so a fuse multiplied accounts for two
flops a multiplying an ad so we're we're
credited with for opt for floating-point
ops per cycle on modern g5 dual g5 that
gives us eight flops across both
processors per cycle and we get 2.5 GB
cycles per second so we peek out at 20
theoretical double-precision
floating-point gigaflops 20 gigaflops on
the new dual g5 2.5 gigahertz Power Mac
so how do you take advantage of this
coming to the platform well if he's
already got compiled binaries for Mac OS
10 bring them to the g5 our flagship
they'll immediately see some advantage
from the ability of the CPU to schedule
to both of those floating point cores
recompile and you get even better
performance because now the compiler
knows there are two floating point units
out there and can rearrange the order of
operations in your code to take
advantage of both and make some
efficient use of the dual floating-point
cords and if you have the opportunity to
think about your algorithms you may be
able to cast them in ways that can
squeeze out a bit more performance this
kind of detail we've paid to our
libraries lib m/v force which I'll talk
about the moment the blahs la pack and
our digital signal processing libraries
be DSP
in the area of single precision floating
point we have a very formidable
capability on both the g4 and the g5 the
altivec cindy processor it's for way
parallel single precision engine again
all the basic arithmetic operations and
a vector fuse multiply add we top out
here at 40 single precision 40 gigaflops
single precision on the new Power Max
and in there are some codes that can get
fairly close to using most of most of
all of those convolutions are very very
effective on that box how do you get to
high performance on the altivec unit you
got to work a little bit harder you're
really going to have to think your
algorithms through and cast them in
terms of parallel operations we have
some advice on the web about how to do
that but first of all it's probably wise
to profile and here's another plug for
the chudd folks they've wonderful
profiling tools that will focus you on
that ten percent of the code where
you're spending ninety percent of your
time look there first auto vectorization
is an option and even a better option
announced this week that GCC three dot
five available later in the year will
have auto vectorization features that's
a good way to get into the ultimate game
and finally the level of detail that
gets really good all perfect performance
and single precision we've already paid
envy force applause d DSP and the image
how do you use these things we try to
make it straightforward try to hide at
least a bit the nature of the platform
from your code you call the API will
dispatch to the proper code suited to
the underlying ship live and the math
library linkedin by default there's
nothing special you need to do if you
want the long double facilities and the
complex api's we have lib MX live and
extended that's a flag on the link line
for GCC and for our value added library
the accelerate framework you simply
specify framework accelerate that gets
you on the air and of course we ship
these performance libraries on every
copy of mac OS 10 that goes out the door
you can always expect to find it there
well what do we do that's new in tiger
we've added a library called v-force it
had been called to our attention that
the c99 AP is for the familiar
elementary functions where data starved
on our machines that we're seeing
bubbles and the floating point pipes
that were going unused so you know so
unused cycles and we hate to see those
go by and also the c99 and I Triple E
demand very careful attention to the
rounding modes and the way exceptions
are handled and that adds quite a bit of
overhead for api's that are only
processing just one operand at a time so
the ideas and v-force were to pass many
operands through a single call for
example if you need 768 values of the
sine of X well there's a call called V V
sine F that lets you pass what all the
men at once we amortize the overhead and
get back in a big big hurry and in fact
that code runs about 12 times faster
than a naive loop calling the
traditional sine function there's some
caveat here
you have to expect I Triple E default
rounding mode and you won't see any
exceptions murder sort works expecting
that you'll give arguments that are
within the domain of the function and so
forth so using these ideas opened up a
number of performance opportunities on
single precision hit off the back hit
off of that card that gives us now it's
a four-way parallel with them to begin
with double precision we've got to FG
use let's make sure we scheduled those
effectively do some software pipelining
fill up those bubbles with sort of
independent parallel streams of
computation and then we take great care
in choice of algorithms to avoid
branching which is very tricky on
pipeline machines tavius word generally
is accurate as the traditional lib em
elementary functions but we're not
always big wide ethically we handled
nearly all the edge cases according to
the c99 specs plus and minus zero are
the occasional exceptions there's
documentation that tells you where we
make no alignment requirements but if
you really want top performance aligned
to sixteen byte boundaries that lets our
cindy engines collect the data most
efficiently we're tuned for g5 but we
also run very well on g4 and g3 too
here's what's in their inventory some
simple division like functions routes x
mentions logs and powers trigonometric
Arctic new metrics hyperbolic and some
integer manipulation how do you code to
these things well couldn't be simpler
the blue is the on the top of see below
as Fortran and in orange the obvious
command-line compilations
what else is new in Tiger we updated to
Atlas three dot six did it some
additional Mac os10 specific tune-ups we
get some la pack performance gain since
it relies on those laws and from some
compiler advances little performance
chart showing in blue our matrix
multiply performance and in the cluster
of orange green and burnt orange the
matrix decompositions lu the symmetric
decomposition ll transpose and the
symmetric u transpose you notice that
matrix multiply tops out around 13
gigaflops on a new dual 2.5 gigahertz g5
powermac and 2d compositions let's say
11 here's what the Xeon gets to a three
point 0 z see on running MKL 6 matrix
multiply topping out well under 10 and
decompositions just about getting 28 and
finally opteron this is quite old
numbers we look about from last summer
they're topping out about five and a
half left summer let's give them fifty
percent since they've up clocked by that
much that would be perfect scaling and
that takes them well not quite 28 I
think on matrix multiply so g5 cruising
along near 13 and the opteron eight
maybe nine and xeon probably closer to
12 in the 3.6 incarnation all right
finally we bring a long double back to
the mac platform it was president mac OS
9 and earlier and complex long double
and the c99 TG math type generic mass so
you can say sign of a complex number and
the compiler figures out what you mean
in c isn't that nice i'm going to pass
by this since i am running very close to
my time and let me just jump to the end
here where i show on the left
our elementary functions number of
cycles to get in and out of our library
functions in the middle column what the
competition publishes for their x87
microcode on hardware implementations of
the transcendentals and finally when
those get wrapped in the gnu/linux
system with library calls where you know
they have to take the additional care to
get the rounding modes and the
exceptions right it cost them a bit more
and I think if you compare most of those
the g5 is clearly ahead there's several
documents out on the web on our
developer site that can get you started
with this kind of stuff we've already
had the accelerate framework talk we're
in this talk if you haven't seen the
judge stuff by all means please go see
it and gohan on to Josh yeah thanks
Steve so today I'm going to do kind of a
brief overview of Virginia Tech system
10 which I'm pretty sure most of you
have heard of its as the 1100 node
cluster at Virginia Tech and then we go
into some detail about some of the
services and OS 10 that can turn off
briefly go over what kind of things we
did at Virginia Tech to kind of improve
our benchmark scores and a very briefly
when just kind of go into some of the
management tools that we use a Virginia
Tech so regimes like system 10 is a loan
hundred dual processor xserve so of
course 2200 powerpc 970 s each x or
cluster node has four gigs of memory
which gives us basically an overall 4.4
terabytes of system memory for the whole
whole cluster one of the things that
early on when we were deploying this as
well they got to be running Linux why
would they run out an or maybe they're
running Darwin but we really are running
OS 10 & Co sent out will ship with it we
did a little bit of modifications which
I'm going to kind of go into but it
10 briefly our interconnect that we use
at Virginia Tech was InfiniBand and we
went with 24 mellanox nice exploring
infiniband switches so these switches
give basically 20 gigabit gigabits per
second a full duplex purport and it's
basically 1.92 terabits per second for
the overall bandwidth for the switch we
got about 12 microsecond latency notes
across the entire network about eight
and a half microsecond Lindsey across
this once which we use the fat tree
topology which I kind of did a really
rudimentary diagram at the bottom of the
thing there of what if kind of fat
tweets apology is and in our case is a
half by section and half my section
means that at any time if you have half
the cluster trying to talk to the other
half your guarantee basically half half
the vandalism which in our case is
basically five gigabits per second in
addition that we also have a secondary
gigabit network which are comprises six
Cisco switches with 240 ports each and
we basically use that to do management
and some file-sharing two basic can't
get the system up and running and kind
of a deal on some of the administrative
stuff on it power and cooling can can
never be emphasized enough when you're
when you're looking at closer especially
this size so it Virginia Tech we're
lucky to have this really wonderful
computer facility that has basically
three megawatts of electrical power half
of which is basically dedicated for
system 10 the cluster so we have a door
and on see with our power we have a UPS
system and we actually have a diesel
dinner which is pretty much the size of
like a.m. diesel locomotives just sits
back on a pad and it's gigantic so as I
said about half that one point five
megawatts is reserved for system 10
cooling we have basically two million
BTUs of cooling capacity using levers
extreme density cooling awesome pictures
later I can kind of plan out but
basically it's a rack mounted system
where it kind of blows cold air from it
from above and we also have for cooling
as well so using standard refrigerant
I'm overhead chillers
so we're looking at different kinds of
cooling and we had a regular data center
and it's a Boise that's you have air
conditioners throughout the room that
basically bring an air from the top
chill it and can push out throughout the
floor and then um you basically put
tiles in the right places to get that
air so we looked at trying to do that
and if we did that we just had a wind
velocities of about 60 miles per hour
underneath the floor so now just pulling
off when there's tiles you get shot
without 60 miles per hour of wind so
that's basically I'm when we say about
system 10 one of things I'm going to
overview is some of the services in
system 10 and kind of we're going to
turn off some of them to kind of
optimize it slightly for more the HPC
type application so SM server by default
comes with about 40 processes it's just
the right or default install and so why
do we want to reduce the services well
one of course is going to free up
resources like memory and CPU time and
increases and security obviously if
you're not running things like a web
server something like that you're not
have to worry about securing that and a
reduced amount of time for the system
startup which is kind of lowers here
your mean time between failures so other
things i want to emphasize is i always
use this analogy for when you're turning
off services that kind of like these
guys that buy the honda civic and they
basically you know rip out the engine
they put a turbo charger on it big
spoilers on it and I co soup it up so
you know they're kind of design it for
their own purpose and the problem with
that is you know hon does not just do
any sort of a hardware support for you
and so they seem to keep in mind is that
your search turn off these services you
know this is something I apples going to
recommend you to do this kind of what we
did here at Virginia Tech so the first
serve basically I can't step through the
different places and oh it's Henry turn
off services and the first one is FC
host config and there an orange or
basically some of the services that run
from at the house config we have things
like couch d which is a printing service
we have the auto amount which basically
handles noun
removable file system to network file
systems and we had a crash reporter d
which would sort sounds important but
basically it's just for them Korean
crash logs for the GUI applications we
have server manager d which is on
basically if you ever ever used with the
xserve server admin a server monitor or
any of those tools this is what use that
so that's one thing I'd like to point
out is if you want to use these tools
and they're great tools you want to keep
keep this service on so basically this
file has a list of services and just
changing the service equal yes the
service equal no I'll disable the
service on the next time you reboot the
next thing we do and this kind of
blasphemy and the ls10 thing is we're
going to turn off the GUI unfortunately
you know we have 1100 gooeys running
there's no real need to do that no one's
that we're going to see these duties so
the place to do that is an SV ttys and
so I have basically there's this very
complicated line there and it needs to
be counted out and coming up that one
line pretty much it's going to prevent
the the windows server from running and
the login window thing to run actually
this just as login window the windows
server is in another place basically in
OS 10 there's all sorts different places
where these services get started and
this one is in a directory called etsy
McKennitt dot d and so the way i
disabled in this is just a kind of
personal preference as opposed to just I
could delete the file I can remove it
instead I'm just going to create another
directory and lose this script into that
directory that way if I change my mind
later I don't have to go find it or make
sure it's the right thing I just can
move it back the next thing that we turn
off is the 80s server which basically
provides font services since we disabled
the GUI we're obviously not getting
advanced service on the system oh it's
nothing that gets run out at sea mocking
addy so basically just moving that P
listen to the disabled directory
directory is going to do that the next
thing that us we're going to hit the
next stop on turn off the services is
basically modifying watchdog watchdog is
this
process that basically monitors your
your system make sure that process are
running if it not if they aren't running
it refers to the processes another thing
that wash doc does that's really nice is
that it enables them the system to
reboot if it crashes so that's actually
pretty nice and hpc thing because the
system will come back up and hopefully
rejoin the network and kind of reduces
your time between failures because it
kind of almost like a self-healing kind
of thing so as we watch dog comps we're
going to disable to services that we
don't need the print service monitor
obviously where I could be running from
the cluster nodes and master master I
didn't know what it was my first thought
it but it sounded really important it's
actually just the the main main server
for the mail server so we turn that off
one more thing that's on there is HW
mondi which is the harder monitor and so
basically this thing is pulling every I
think five seconds is the default and
basically is just keeping track of your
your your hardware's to keep track of
all your fans or your temperatures
throughout the system and and just kind
of records that it can also send
notification stuff like that and what's
up five secs every five seconds is a
little little too much so we kind of
bump that up by adding data 60 and
that's going to make it only run once a
minute so that kind of reduces the CPU
overhead of this one service next thing
I turn off is mdns responder and I was
feeling that this is not gonna be
something we may be able to turn off as
more and more things for to rely on on
rendezvous so we have things like eggs
grid and if you plan on using X red you
don't want to turn this thing off
because extras going to use rendezvous
to find find other other compute nodes
and if you mean there are things like if
you ever want to use the distributed
compile option and Xcode it also uses
rendezvous so you know if you're not if
you don't plan on doing anything with
rendezvous and this stuffing something
can turn off it will kind of reduce the
amount of network stuff it sends out and
a little bit of CPU overhead so this one
basically has a script and the system
library start of items folder and
basically I
I just comment out the line it starts it
so in I was ten there's lots of
different places to kind of look for
services and you kind of see there the
great out services that uh that we kind
of went through and there's some things
on there that people will disagree with
that either need it today or some things
need to go like I leave the time server
on there and because I think that's
important for the cluster that I run my
leave Kron turned on because we actually
use kranj can't do things every so often
on the cluster but I'm some people
concern that off and not have any issues
with that so I'm going to talk about the
Linpack optimization and kind of things
we did at Virginia Tech Lynn tak is
basically the benchmark that's used in
the top 500 list so we were number third
in the world in November and so the way
this is establishing we have to run this
benchmark called hpl so we had some some
this was about a year ago and this was a
fort a lot of the the optimization
splendid accelerate frameworks we had a
person japan in Kazushi go to and he did
basically some assembly level
authorizations on major subroutines
basically easy on the d gem subroutines
i have a website there for more
information if you want to look at his
optimizations and one things that we had
to do though is we had to kind of write
our own memory andhra because the the
blast routines that that he was writing
did a much better job if it was
guaranteed a contiguous physical amount
of memory as opposed to having it kind
of gets ugly in or out or partitioned so
with those optimizations actually had
about ten percent increase over over the
Apple decklid at the time remember this
was using Jaguar so we didn't have cell
rate we're still using backwood so with
the optimizations and some of the
tweaking that we did at Virginia Tech we
actually got 10.28 teraflops per second
which is the third fastest in the world
and without those sketchy authorizations
we probably have gotten around eight
point four teraflops which on that list
probably
put it on forward very quickly I just
want to kind of go over some of the
system management stuff and I could I
can talk for maybe two hours or five
hours for 12 hours on this it's
something I do a lot of work with the
tool that I love for system and it's
called ganglia and I know like bio team
uses it inside their package and what's
really great about gangly is it runs on
each system it kind of just gets to some
status and kind of broadcast that out in
the network so by default it has a
couple displays and I had a few their
displays at the top like at the top
there's the cluster load percentage so
it's kind of really great can see what's
going on with your 1100 systems you know
you kind of get a take a step back and
be able to see what's really going on in
the cluster and what I love about this
you can drill down so we have that big
cluster overview but you can drill down
and look at like a specific node and
what I love about is that XML data you
can parse that XML data so we have
Virginia Tech way they basically kind of
a custom display there that kind of
shows the physical representation of
what our clusters doing so we can see if
the CPU can doing something weird or if
we can look at temperatures and loads
and can't get a physical view of it it
really helps with just um just quickly
discovering what's going on with our
system so you know things I'd talked
about of course words you know over in
our system 10 reduced in number services
and what we do in the Linpack scores in
some of the management features we did
so people of course or if you went to
dr. Rajan on presentation yesterday you
probably saw some of this but people
keep asking us so what's going on with
system 10 you know we dropped off the
list and it's because we swapped out or
powermax and we're acronyms extra so i
can say that um people are very hard at
work the song systems and we have 850 n
and so there's some some of the racks
that we have and you know one of the
things that that is really interesting
is that we have we're using basically a
third of the space that we do with the
power max so only have one aisle where
we can do all the cabling and so it gets
kind of crowded and
uh there's I don't know how many people
in that picture but that's a small space
for a whole bunch of people and let's
say see us doing the wiring in the
background have to wire your Ethernet do
the power and I run infinite bands so
quite a bit of cabling going on so with
that I'm going to introduce beam Dowager
from Dowager research all right Thank
You Josh yes so let's see it's
definitely a pleasure to be here today
and to be speaking to you is I would
very much appreciate the kind people at
Apple to invite me to come out and talk
about plug-and-play clustering and how
you can build your cluster minutes and
so um what I'd like to go over first is
an outline of what like to talk about
and first of all why parallel computing
why parallel computing was interesting
to do and what we did to go about
inventing they're essentially
reinventing the cluster invented the mac
cluster and an introduction to basic
message passing code and then the
description of how you can build your
own mac cluster and and hopefully if the
demo gods are kind to us we can I can
show you what we can do with a mac
cluster so why parallel computing really
parallel computing is good for problems
are too large to solve in one sense or
another on one computer well the simple
reason is simply taking that too much
time too much cpu time but also in some
cases are in many cases i know it
requires too much memory they can some
problems can easily outgrow the RAM
capacity that's available on a single
box and I know Cosette runs 15 billion
particles and has keep it all that data
all in RAM and so multiply that by
however many dozens of bytes per
particle see that's quite a bit memory
space so the other thing that's happened
in the last decade or so is that so the
programming API has become standardized
on what's known as message passing
interface I'm also known as MPI it's a
specification that was established in
1994 and it by the end of the 19th it
became the dominant software interface
that's available at supercomputing
centers
such as the San Diego supercomputing
Center as well as nurse and also on many
cluster systems and and so this the
development to enable the possibility of
having portable portable parallel code
that code that's portable between the
three competing centers and the clusters
in both for Tennessee by using MPI and
that's been a real benefit to scientists
and many other many other users of such
systems so they give you an idea of some
of our experience this is a current
picture of the UCLA physics Appleseed
cluster established in 1998 and as you
can see we use a mixture of the g5 from
g4s connected with this bath switch and
we are running on a mix of OS 10 various
versions of OS 10 as well as OS 9 so
we're able to mix and match notes older
and newer hardware and then we can
combine this cluster with machines that
are on people's desks such as my
colleagues professors or postdocs for
graduate students combine them as who
needs to when they're away to go home
from work or on our vacation or if like
a colleague need something needs time
just before a conference they can go
ahead and just use the machines after
mission and evolve them together and
that's really saved a lot of people's
work so and just a little quick notes
this is a picture just from last week
the Dawson cluster it's going to be 256
X serves dual processors currently a
hundred twenty-eight online literally
physically just assemble last week and
we're able to get this picture check
with gigabit rain 10.3 so will be
definitely having some results of that
later the month so a cluster computing
with Mac os10 essentially we went about
reinventing the the cluster computer and
it really is very very nice approach to
question computing much more reliable
than many other systems and I'm familiar
with its independence storage or any
kind of command-line login or static
data and like machine lists or static IP
files and so that leaves a lot to a
great deal of reliability because you
don't have to make sure that every
little switch is just right in order for
the cluster to work and so there's
always alts in the lowest barrier to
entry for people who are using clusters
and really saves a lot
money and really the purpose of this
whole of this whole approach is to be
able to enable users to focus on getting
useful work done so they don't have to
be bogged down with the mechanics of the
cluster they can actually get real
research and we will worked in and that
was that was our motivation to be able
to assemble and design the Mac cluster
so the Matco cluster technology designs
really divided up into two parts Mac MPI
which serves as the MPI layer and pucha
application which stirs supplies the
cluster infrastructure Matt Kemp e is a
source code library normally compiled
within the application and it forms an
NPI wrapper over the tcp/ip stack in the
offering system we have two flavors of
it mackay and i underscore roman numeral
10 it runs both OS 9 and 10 using carbon
and an S version of the users UNIX
pockets which results in better latency
the pooch application if the utility
that dynamically manages a cluster in
parallel applications running on it and
monsters health missed a dose of the
cluster and it queries the nodes using
rendezvous and done so since 2002 as
well as SLT to be able to provide with
compatibility with OS 9 and queries them
to be able to be able to determine more
information about the cluster it
provides for user interfaces the cluster
one a GUI an apple script interface you
can you can direct it with a command
line as well as other applications can
direct food store using apple events to
be able to grab the grab notes on the
cluster and support three different
MPI's currently Mac MPI as well as MPI
CH commonly on commonly on Linux and a
commercial MPI named MPI pro so let me
give you an introduction to parallel
code using MPI basically it's it's code
that coordinates it's worth using
messages the models that there are end
tasks or virtual processors that are
running simultaneously you label them
from 0 to n minus 1 and these
executables often use this tent
ification data to determine what part of
the work they're going to do and how to
coordinate worth between them and so
they pass messages between all between
all these virtual processors or tasks to
organize the data and organize the work
it's really analogous to a number of
employees that accompany you make phone
calls because ever had meeting
is to be able to coordinate work on them
and to accomplish a much larger project
any group of tasks can communicate with
supplies that are order n square
connections that are that are supported
by the NPI and that can support simple
sends and receives as well as collective
calls such as broadcast where you're
sending from one task to all the others
or gather where you're collecting data
say for doubt data output or reduction
operation such as computing the maximum
of an array that's spread across the
cluster or the some or other other
parameters like that and also major
about trans matrix operations such as
transpose and vector operations and
synchronization is not required in
between the the task no precise
organization is necessary but it's only
implied by the fact that messages need
to be able to get from one cast to the
other so they give you an idea of what
looks like i'll introduce a a simple
example the simple example i know of
message passing called we call parallel
knock and in this case the dot in this
diagram the time axis is down and we
have two tasks they're communicating
with each other at first task 0 sends a
message to task 1 and they both print
that message and so pass through a
prince the message of just sent and task
1 prints the message of just received
and then a reply is sent back for task
one to pass 0 which which is then
printed by both tasks and so task 0
Prince the message to just receive the
reply and task one print replying it
just sent so the give an idea of what
the code looks like this is an example
that's fancy but it's also available for
trans you can go and check out our
website to that and the way is this is
buy it up I deep rock is the ID of the
virtual processor for that particular
task and the top part of this statement
is executed by all the odd tasks and the
bottom half the esteem is executed by
aldi by all the even tasks are all the
odds hacen envelope and so what happens
first is that for example for TAS 0
Casarez doesn't performs an NPI stand on
its send message to Heidi proc plus 1
which is one and at the same time and
the lower half day of statement is
executed by task 1 it performs an empty
i receive which receives that message
from ID
truck which is one minus one which is
zero and so that's performed the that
performs the message passing and then
they both print their messages and then
reply is sent that from task one by the
NPI sin and the lower half of this
statement of the reply message back to
Castro correspondingly there's a receive
bite Astro and then they print those
messages and so the result of running
this code looks like this tesoro says
knock knock task one says who's there so
that's an example of a simple
conversation going on between the two
now the next example like to go over is
Pascal's triangle this is example that
illustrates local propagation
l'application I mean is that every
element eventually interacts with every
LOL every other element in the problem
and but its local to every the
propagation the interactions are all
local because any one element is simply
the sum of the two neighboring elements
in the preceding line and so eventually
they all interact with each other but
every interaction itself is local this
is similar to a variety of physical
problems such as fluid modeling where
you're looking at say fluid flow fluid
flow through like inside a plasma or
inside or inside a blood vessel or
things like that and you can use partial
one tools they can use partial
differential equations that we're
looking we're having neighboring
elements interact with each other as
well as the elastic deformation which
which occurs when you're trying to
simulate say the using finite element
modeling to be able to understand how
the Earth's crust is is going to deform
when say a false vault flips or a
Gaussian blur where you're talking about
one point spreading this information to
all those neighboring ones and so forth
using localized convolution or molecular
dynamics or you have molecules
interaction each other in a local manner
or certain parts of particle-based
plasma models are all those those kinds
of codes are all good examples of local
propagation so apparel of Pascal's
triangle the way that you recognize
where the message passing is the first
layout set you can think of this as the
time axis being down from 12
11 and so forth and that thing that
recognizes to understand where the
communication is happening the problem
in order to perform the computation and
so what I've drawn here is all the
arrows indicating the all of all the
places where there's a certain amount of
information being propagated or data
being propagated from element to element
and so the thing the wrecked and so the
thing to recognize is that when you
partition the problem let's say up into
three different sections you can you can
recognize that a certain amount of
information or certain amount of data
being propagated through the partitions
in between each section of the problem
and so you can handle all the internal
communication as as normal for any
single processor code but then the NPI
calls corresponds to the arrows that
cross the red boundaries they're here
but by just choosing this method this
this arrangement for the partitioning
the computation becomes proportional to
the volume of the problem and the
communication becomes proportional to
surface area so you can think of it sort
of physically that that you'll probably
end up with a good communication to
computation ratio if you with this kind
of organization so by splitting it up
into the three different sections
imagine the three different tasks
running these are the messages being
sent to receive so that for every odd
and even line you're sending messages on
this to the left or to the right the
left to the right for every for every
ultimate line and so for the computation
all the computation even though is
simply that there's an array to deform
and take to compute the value of an
element in one line you simply sum of
the previous two but with the message
passing does it fills in the gaps as it
needs to to be able to propagate the
information in between each section and
so you can see say the left edge of the
middle task is a duplicate of the right
edge of the left task and so the fact
that there's a duplicate this is also
known as large cells where you're able
to set up these kinds of guard cells to
to allow the competition to proceed as
if it was the only processor running but
then the NPI simply fills in the guard
cells at the moment where it's needed
and so this is a this is actually a
fairly prototypical example of a lot of
local propagation type problems so
security code example again this is
available in Fortran as we'll see in
this case
this if statement is alternate between
odd and even lines of problem and for
example we start at the top part of the
if statement we have an MPI receive
that's reformed on the right edge of the
array from the right processor and what
I means if that's an immediate receive
it immediately returns also known as an
asynchronous receive so you're allowed
to continue to execute while the receive
is happening and then an MPI sin is
performed on the left part of the array
array 0 to the left processor and then
an NPI wedge performed to be able to
balance out that I received it and
complete the I receive that came before
so in this case you were since everybody
is sending something to the left that
means that you're receiving something
from the right and so that's what that
corresponds to likewise there's an i and
the lower half the famous we're doing i
received from the left and then ascends
to the right and then a way to complete
the receive and so we're all sending to
the right instead and so the result of
this code is like this if we divide us
in the three different task and the way
that this is drawn is that all the odd
values is drawn with an asterisk and all
the eval you have the space and so we
can see that they're essentially task
one has a seed at the top which then
propagates through and propagates out so
that out of the boundaries across the
partitions into the section on task 0
and task 2 and so by forming by aration
to say we can see that we we actually
have maintained our guards salesman if
you look carefully the left edge of task
one is identical to the right edge of
test 0 and so those guard cells would be
maintained by the NPI and so we see that
successful there and the other thing
that's that we see out of seems this way
this forms a shape and also knows as the
sierpinski gasket in the Pascal's
triangle so we're able to perform this
problem using MPI in this way so that's
just one of many possible message
passing patterns that are available that
are supported using MPI and for example
that's the example the nearest neighbor
on left and of course the arrows are
reversible so you can do left right left
right and the upper left is another
common message passing pattern called
also known as master-slave this is
something that it's relatively simple
and so
in this case it shows a broadcast from
one task to all the others or you can
reverse system course to a gather but
also the auto all communications pattern
where every note is communicating to
every other one that's all that's very
important for data transpose of matrices
and that's important safe for performing
a 1d a very large 1d FFT in parallel you
have to go through data transpose is and
consequently the message passing
patterns you have a lot of all the all
communications or if you died by the
message passing times like a tree where
one is sending two to others and they in
turn send the two others or in a regular
pattern for a more regular problem or
any combination of the above or any
multi-dimensional versions in the movies
so these are all things that are that
are possible with mpi and are important
for a variety of interesting problems so
to give an idea of some of those
interesting problems these are the
applications that have been run on mac
clusters that's not familiar with and
for example and the fls is a picture of
the electric tokamak device so tokamak
is a plasma too but the plasma device
that attempts to hold a plasma in
confinement in a ring shaped pattern or
tourist shaped pattern and one of the
things about many many kinds of tokamaks
is that the plasma in it that is is is
so is is very hot and very hard to
handle and so typically light leaks out
to the walls around it and so they
wanted to try to confine it better but
in order to be able to be able to see
inside it if you stick a probe in it
it's so hot it could just be / eyes so
we're very hard to be able to try to
really probe in there so that's like
computational simulations very
interesting to do and this is an example
a quicktime movie made from a diary
kinetic simulation of a tokamak plasma
in the cross-section so the electric
potential is seeing how it evolves from
the linear state to a saturated state
whoops and then on the right is the
planetarium rendering that was performed
by a customer over Northern Kentucky
University this was submitted to the
first-ever full dome festival and
actually one of the ward this was
performed on a 50 node Matt cluster
rendering out the the three-dimensional
simulation inside a planetarium on lower
left this example that comes from dr.
Wilson back over red
SD and biology where he's using that he
he and his colleagues would a program
called P mr. BAE's that computes the
posterior probabilities of phylogenetic
trees so say that five times fast the
and what it studies is it looks at the
DNA the similarities in DNA between
various species and tries to determine
the evolutionary path in between them
and this was a code that he consulted
with me on to be able to do
parallelization as well as vectorization
with vectorization where we were able to
get a three-times speed up and of course
the parallelization were able to boost
that even more by the number processors
Thunderbolt on the lower right the
quantum pics simulate some diagrams from
Ponte tick simulation is in this case
this is showing a two dimensional
quantum wave function in a super
harmonic oscillator and showing the
circulation of the electron around the
wave function this is actually work that
was based on my doctoral dissertation
which I did entirely on Matt clusters
and what it involves is an approximation
assignment pass integrals that to be
able to choose sample just to pop all
the possible classical paths and use the
plasma code to be able to push those
paths forward and determine the
evolution of the quantum wave function
so the mat cluster recipe basically this
is all the description that you need to
be able to assemble a Mac closer the
ingredients simply take a bunch of power
max or exergy force or g 5 upgrade the
memory as you need to and get a fast
these nets which were faster if you have
more money and get a bunch of ethernet
cables then the directions are
connecting cables from the max to the
switch and download pooch we can get
from our website and assault pooch they
only take seconds per node to be able to
install couch and then use the alphabet
fascism and to be able to test the
cluster and so what I like to do is to
be able to see if I can give you a
demonstration so if we could switch to
demo to yes thanks
and let's see okay so let me all that
okay good so let me give you sort of a
prototypical idea of a numerically
intensive code that that that we have
here this is known as the Alpha X
rational demo right now it's not using
the vector processor that's in this g5
here and and it adds up it uses those
eases the fourth computation something I
thought a little bit more numerically
challenging and and something and also
it counts up how many floating point
operations dozen times itself to bit ly
determine how many megawatts with chiefs
and gets about to 1,100 mega flops in
this case but if i use the vector
processor i can go ahead and use math
and it goes quite a bit faster it's
about you know 56 gigaflops or so which
is pretty nice and so and i get this
also to make use of a dual prostitution
and that gives me another factor of two
but you know what if i want to get
beyond a factor too well that's where
parallel community comes and that's
where pooch comes in so this is how long
it takes to install pooch just
double-click on the installer and there
we are and pooch is an acronym parallel
operation to control heuristic
application and let's say I just need to
be able to login to the cluster and to
be able to start up a new job I go ahead
and click new jobs from the file menu
and opens up a new job window and this
will hold a list of thought this has two
panes in the job window holds a list of
files on the left and executable that
will be copied to the machines listed on
the right and an execute as a comparable
computing job there so if I click on
select app I can go ahead and use the
file dialog to be able to navigate
through the the final manage but I
really don't prefer doing it that way I
prefer using drag-and-drop so how many
of the perils creatures can you think if
you could launch using drag-and-drop
there there really aren't too many by
default is flexible note the note i ah
i'm on which is nob hill demo too and if
i click on select nodes the sopes of a
new network scan window
she uses both rendezvous and SLP
simultaneously to determine the names
and IP addresses of other machines on
the local area network that's here and
so I can see that it drew it used this
information it uses the IP addresses to
be able to contact the pooches on the
other machines in this case of the X
terms that are here and involve those
and and then determine whether or not
they're busy are ok busy means that
they're running a parallel job will show
up in red letters or how much RAM they
have and it also queries other
information such as G you know what's
the clock speed of them or what
operating system or how much low does it
have how much disk space you know when
was the last time you know someone
touched a mouse or and it uses this
information to be able to form a rating
of the cluster and so it helps you
choose the nodes that are more suitable
for writing in the cluster and it
actually can give you a recommendation
you can you can go ahead and choose the
add best function if you want or you can
or you can go ahead and drag and drop or
double click on the notes that are there
so if I click on the options of the job
window this opens up the options drawer
and you can say place the executable a
particular direction on each one of the
machines maybe perhaps delay the launch
till some later time of day so like
after a colleague that leaves home to go
from work and you can also pretend that
you're on a very very large system by
launching as many 10 line as many taps
are there by default it launches mate
asus there are processors but you can
also really benchmarking or or stress
test your code we support three
different MP is so described earlier and
if you want to get through a firewall
hughes particular report number a cue
the job for later execution so if you
able to launch the job i go ahead and
click on launch job and it's copied xq
ball to the other machines and then
passes control to the to the parallel
computing code and which then divides us
into various the various different
sections and then collects results back
here for display and we get something
like 44 gigaflops in this case so thank
you
let's see I just want to check something
no okay so from there and just to show
you that this isn't just for fractals is
that this is an example of the physics
code that we have and let me go ahead
and actually oh that's fine I'll just
involve the same same nodes that are
here and we can go ahead and and and so
what's happening here is that this is a
this actually applies the boots code
it's running at least a few million of
particle simulation and it's being
performed on the nine processors that
that are available here and if I go
ahead and run this job we can see that
the electrostatic potential is going to
show that there's a plasma instabilities
that grows out of that and we can see
that in the lower right there is a Mac
the Mac to MK monitor window which is
very useful for diagnosing and debugging
the parallel codes in MPI and so the in
the in the top part of the window it
shows the messages it white means it's
not sending any there red means is
receiving the other green means it's
sending yellow means of swapping and so
this in the end so a typical thing that
happens when you're learning how to
write apparel a parallel computing code
is that a lock-up happens and so it
freezes in the light pattern of the of
the hang and but also down below there's
a histogram of the messages being sent
and received as a function of message
time so that's encouraged you to send
many fewer large messages with those too
many smaller messages it also shows you
vials of how much communicate how much
time it's been communicating or how many
megabytes per second are being censored
received in between these machines so
this is a utility that myself and many
colleagues and many venerable
institutions that used to be able to
diagnose and debug their codes and they
give you another example of the code I'm
also ethics code this is an example of a
code that performs a furnell diffraction
problem where you have a point source of
light going to produce in spherical
waves and projecting and diffraction
image on the screen and so from there
while we can you know we can actually
this actually has a feature where you're
able to automatically launch itself in
parallel on
sure and so this is a way that I I would
hope that applications become so easy to
use because simply drag them you can
just simply use them menu click to be
able to have it launched itself onto the
cluster and make use of the resources
that are there you can see again the
maximum monitor windows showing the
messages being received mostly very
large messages we make the problem I was
go so fast I got to make the problem
size bigger and then it's also showing
the colors of different parts of the
problem that are being in the signs good
phrase processors and so this one more
feature one other thing I want to show
you something that was just announced
this week was it was what we call pooch
pro it has a new user menu where you can
actually assign a certain amount of
quota for each user and so it computes
how much compute time was being used and
then this is a cluster the only cluster
or sugar you could I know that has
rollover minutes so you can roll over
your compute time from week to week
let's say and you also found this is
something you would only see as an
administrator you can actually
administrate the users that are there
and like a means that it's administrate
you have administrative capabilities for
cumings has a quarter rollover minutes
and my great being able to migrate and
pass for changing you can have different
passwords and so on so I can double
click on the particular one and edit say
how much time are good for water yen has
has CPUs like I know let's say me I'll
give them this a really little time or
something like that in this and not now
sought to limit changes password okay
anyway so these are the kinds of things
that are available in future pro so that
will be in from the demo can Alex switch
back to slides thank you so just very
quickly for more information the
reference the reference library that we
can pursue basically goes with the dead
roosters website you can final hole
bunch of information the pooch websites
and find out the cluster recipe you can
download a trial version and we have a
tutorial and writing parallel codes as
well as a zoology of parallel computing
that as a description of the bridge
parallel computing types and will this
all be linked from the WBC URLs as well
as the parallel knock tutorial with both
code examples for Shannon see a parallel
adder tutorial both languages parallel
Pascal's triangle and as well as related
publications and actually another video
that's
little bit longer than we've been
displayed here of some of the work that
we've done so like to introduce the Ford
and thank you very much for your
attention and ce4 time CEO of gridiron
software I'm going to go over a real
brief overview of what we would call a
next-generation parallel computing
framework and we're going to do that
really from a very commercial
perspective so probably a lot of the
same points that you're going to you've
heard before I'll go through a little
bit but uh we'll go from here so one of
the key things that is these that we
work with are looking for are obviously
speed but a lot of times the resources
that are available to end users for
products that they ship are not best so
you get into a scenario where I need to
provide one hundred percent performance
or provide linearity for every CPU that
I add because you have might have a
company like a Pixar that has thousands
of machines but you also might have a
small post-production facility or
something like that sitting in the
basement with just one or two machines
kicking around is that actually going to
provide some value for them so the
challenge for developers is to how do
you build in a parallel application that
provides this performance it is very
easy to use and seamless fashion power
is v's are really interested in the
money quotient this is what I would like
to call our are a million dollars slide
but from this perspective this was a
customer we have in the print space and
we actually did a comparison between
five g5 Xers and and a sun sparc fire
6812 CPU the interesting thing is is
that this was the result and this is the
cost that's generally associated with
machines like that and you can kind of
get the idea of why commercial software
vendors are very interested in seeing
how can they provide this functionality
from a commercial perspective to
everyone in their user base so message
the grid computing we've heard a lot
about different things but from the grid
perspective
there are three basic kinds there's the
middleware perspective there's the
opportunity for message passing which is
being talked a lot about and some
development tools that try to make this
whole black art of parallelism a little
bit easier on you the developer scripted
distribution there's obviously some pros
and cons it's very good as we see with
distributed resource managers and if
you're familiar with things like X grid
and that kind of stuff to go out and say
okay I'm going to use existing resources
with existing applications and do things
across but they're generally needs to be
some sort of skill set for the end user
to understand how to do those things so
it's very useful in areas of scientific
computing it in research but when you go
into a shrink-wrapped application and
you're trying to put that onto a CD it's
a little tough for a lot of the user
base to grasp message passing as being
talked a lot about it's used quite
extensively in in the scientific and
research areas but the interesting thing
that we found as we went through our
engagement with several is fees from the
commercial perspective obviously pros
and cons from that perspective but the
biggest thing was was that there wasn't
a lot of confidence with their ability
to ship that with a product so it was
the learning curve associated with
actually putting that into their
products and their users to understand
how this thing works development tools
is is where you're probably going to see
a lot more emphasis on this down the
road especially as chip design and so on
is going to move in a few different
directions but from our perspective we
wanted to create an application
development environment that we've had a
very high level of abstraction so
message passing is a message passing
interface but to turn around and say
okay you still have to write a parallel
application you that provides you the
messages but everything else that you're
associated so you not only have to worry
about how do you partition your
algorithm but then how do you message
and then how do you build all the things
such as discovery and resource fault
tolerance all those kind of things
you've got some good tools again like
beings tool that can come along and work
with mpi and so on on the on the top but
from the perspective but what happens
from within and that's something that's
very important gridiron accelerate just
as a real brief
overview it's a peer-to-peer base
distribute computing architecture and
the api's are built into source it's
more wrapping the source I'm going to
give a quick example of that what we did
for some MPEG encoding and then the work
is dynamically addressed across the
network and that can be to a dedicated
cluster or to specific resources of
death loss it doesn't really matter the
key thing is is that you can go into
that scenario very quickly and from
within your application once it's been
programmatically added provide an end
user with a very engaging experience so
the development tools have a lot of the
same you know pros and cons there's
obviously requires code modification and
we as developers don't like to modify
code it's it's a very non trivial thing
especially when you get into busting up
algorithms anybody who works in
multi-threading can you know go into
that assessment but from our perspective
we're kind of like a hybrid between
openmp and MPI we wanted the ability to
provide in the demo I'm going to show
you with Adobe After Effects was to be
able to take advantage of just another
machine another cpu within the same box
for a very serial threaded based
application the grid are not too great
obviously that's the big question what
kind of development work or effort do
you have to put into the parallelization
of your code to return some results and
is it worth it that's the big question
and that's that's always wipe Rio the
parallelism has been the black art so
from our perspective we wanted to really
start to get rid of that black art
connotation provide a an interesting
framework to do this a lot of times I
mean there's been a lot of talk also the
processor intensive applications but
ironically most of the applications that
we've worked with they have more data
problems data movement I you know reads
and writes and those kind of things seem
to be the major bottleneck with a lot of
the applications that we've worked with
so we wanted to come up with various
different means again sing to you the
developer really you focus on your
algorithm you focus on the thing you
know very well and we'll provide you a
parallel application that you can call
into so which grid method views
obviously if you're doing embarrassingly
parallel with scriptable type things
distributed resource managers and
scriptable scripted batch queue systems
are very good
if your source code is not available I
can that's probably the only route that
you have but then you also have the
opportunity depending on the resources
at your disposal to go into a message
pass or into another type of frameworks
such as a development environment
quickly on grid enabling an application
obviously 9010 8020 wheaton basically
the same thing but when you're looking
at the 8020 rule focus on that twenty
percent of the code that does eighty
percent of the work the abstraction
level again is very important here
because what really we're trying to do
from a development perspective to
there's no such thing as real automatic
parallel ism but from that perspective
maybe there is ways to wrap and provide
hints instead of breaking your
algorithms in other words they can still
run the same way that they did before
and you don't have to worry about
totally wrecking your application
application modification we broken out
our architecture ninja 3 plugins
defining patched class compilation and
then result reassembly but again the
goal is to provide the end user with a
really engaging experience where they
can basically think that it was just
done on their PC just on in there Mac
and they can use whatever machines that
are on the network MPEG encoding was a
challenge that we're given by a certain
company that does a lot of encoding and
we wanted to see what kind of results
because this was high data so this is HD
video we actually went in and did a
modification and we ran the test on
several exurbs and we got some very
interesting results basically we tried
it out originally on 12 we did actually
go up to 40 but we started seeing some
some degradation in the curve of
diminishing returns but we took an HD in
encode and brought that down from two
and a half hours 26 minutes and provided
the seamless result the nice thing is I
also didn't have a lot of disk space I
didn't need to have a lot of things it
just went and dynamically knew the data
when it needed to and brought the
results back to the end user more
importantly from a development
perspective we only modified 100 lines
out of this one hundred eleven hundred
source files we modified three did
basically about a thousand lines course
we published a white paper in this hole
there's a lot of comments and that's a
available on our website if you do want
to check that in
so another thing that we did is in what
we're shipping with right now is with
Adobe After Effects and I wanted to kind
of show that to you if I can switch over
to demo too I wanted to show that you
see very quickly because this is what we
feel the end user has to experience and
this is the challenge for us as
developers to bring into an engaging
experience for your end customers the
funny thing is well and I'm just going
to make a reference I think in the
keynote on Monday there was a reference
to you know the challenge of bringing
chips into a smaller design well a lot
of folks are announcing now the ability
to go into a core multi-threading or a
multiple cell type chip so parallelism
is going to be absolutely key if our
software environments are actually going
to take advantage of it so one of the
things that we did in this scenario is
that we're actually going to use these
for exurbs here but from an end-user
perspective they don't know anything I
mean this is a product that ships you
can buy it at frys for 900 bucks and
from that perspective if you go in there
they don't know anything about the hcp
they don't know anything about DNS they
don't know anything other than how you
have to plug it in and I hit go so from
that perspective that's literally all
they have to do they go out it'll
automatically dynamically find all the
other machines it'll pass the data
that's relevant for it to work on but
the most important thing is is to
provide the results writing an
application in a method that they're
very familiar with so if you look down
here we're starting to bring in this is
an 88 1080i clip by the way for those
who are interested but the interesting
thing is is that we're bringing the
results right into the RAM cache of
after effects so from the users
perspective this looks just like they
always worked and that's very engaging
the other side too though is that we
have an interesting side effect of using
a grid to do all the work and that is
you can render and work at the same time
that's never been doable before in
single-threaded application like this
and I can do things and render and so on
so I'll go back to other slides that's
just a very quick demo and the power
that provides an end user is very
engaging and we've been able to see that
this has been shipping for a month and i
think the stat was
seventeen thousand users are using this
cobbling together machines in their
basements using it in very large
infrastructure as well the NBA finals
were brought to you by this and from
that perspective you know if we can as
developers bring engaging experiences
for any customers our products will come
at the market in a very engaging line so
summary obviously speed speeds great is
it worth the work you have that's really
up to you you need to look at
environments that are going to help you
get to a more optimized and parallel
infrastructure without the headaches or
the worry of breaking your code there's
going to be new hardware technologies
coming down the road specifically
multi-cell chips that are going to mean
parallelism is absolutely key so we've
got to start thinking about it now and
significant linear performance is really
the thing that customers want to buy