WWDC2004 Session 615
Transcript
Kind: captions
Language: en
hello everybody this is the second of
our 2 X and sessions X an in-depth
basically i'm greg vaughn and i'm going
to be talking of a lot of the same
material that was in the overview
session how many of you were here for
the overview session okay good i'm going
to cover some of the same material but
from a slightly different perspective i
won't have any product prices up here
stand a little code samples and
hopefully some more technical
information going to start actually
going over some of the same stuff with
the different types of file systems just
to separate them out and make it clear
sort of in terms of file systems what
makes a sandpile system different from
other types i'm going to talk
specifically about x am and sort of the
communication protocols and how it's
really working i'll go into a bit of
depth on the x an admin will show a demo
of setting up the sand some of the other
features of the admin i'll talk about
how the volumes work in terms of when
you're writing apps there's some
different characteristics both from
local volumes and network volumes you
need to be aware of and then finally
I'll talk specifically about some
developer api's that can be used in the
applications to make them work better
with x and b more x and we're so
starting out with the different file
system pipes Tom mentioned
direct-attached storage which is
basically you know your traditional
local hard drive just some new fancy
names for it the way in which a local
hard drive works is you've got your file
system the drives presenting just an
array of blocks the job of the file
system is to organize that data using
the catalog present a higher level API
up to applications so in a typical
diagram you've got basically at the high
level API you're dealing in terms of
files the application is going to open a
file write to a file use
biol offset the file system is going to
translate that to the blocks on the desk
this is the technology that's been
around for decades but of course a
block-level device can't be shared by
multiple computers because there's no
way to keep the catalog in sync between
the multiple computers so this early
limitation got solved by the network
attached storage or the file server
network attached storage basically is
just taking the exact same file server
and put it into a box so the underlying
technology is exactly the same the idea
here is you're going to take the
high-level call ship it across the
network to the server and the server
will perform the action so you've got
the same diagram you're making the call
at the file system layer the network
file system directly mirrors that call
and on the server when it makes the call
you're doing the offsets to the but disk
this will allow you basically to have
the volume integrity but at the price of
bundling all the data over the network
to the server and all funneling it
through the single machine the other
slight thing about that in dealing with
file servers is when you scale them up
you tend to have problems with the
different types of requests especially
metadata intensive requests doing large
directory listings and so forth cause
the server to load catalog blocks off
which can interfere with the streaming
of data off the hard drive and
conversely those metadata requests can
be locked behind the large iOS and when
you're dealing with heavy file servers
you see the big Layton sees and like
opening up a little directory listing
and that those are really hard to
overcome those sort of scalability
limitations and file servers raid solves
the other part of the problem you're
you're trying to overcome the
limitations of disk speed by combining
the multiple drives in addition to the
performance you're giving the
reliability of redundancy between the
draw
I've important thing about the raid is
that it's happening sort of behind the
scenes of the filesystem the file system
isn't aware of that the raid system is
presenting the single Drive to the file
system and the fight and the raids
mapping the block offset internally to
the multiple drive and you've got the
different raid scheme I always get them
confused specially raid 0 and raid 1 but
raid 0 you got the striping mirroring
provides the redundancy and then with
raid 5 you've got both the performance
of writing to multiple drives plus the
redundant data so that if anyone drive
fails it can be rebuilt easily without
losing access to your data in addition
to the underlying raid you've got
software raid which happens at the
driver level to distribute do your data
out to the multiple drive and then once
you've done that then the raid box will
further distribute it out to the
individual disks but the point of all
this is that the raid system is all
happening down at the block line device
the file system is still your
traditional file system be it HFS or ufs
that is just dealing with what it sees
as a single array of blocks and it's
maintaining the same catalog data as it
always would so it's just doing the same
translation down to the disk offset and
then down at the raid level that's being
distributed out to the multiple disc so
the problem with that is you've still
faced with the file server if you want
to distribute your dear data even though
you've sped up the hard drive all the
data is flowing through the one file
server and that is your big limitation
so how the fan file system really works
to overcome this is by separating out
the notion of the catalog from the data
there's no real reason why the catalog
needs to live with the data once you've
decided a one particular place to store
your catalog you can have a special
purpose server that is going to just
deal with the catalog and do part of the
job of a traditional file system
basically update the catalog and figure
out where on the drive the data is going
to actually live that way the client
file system can talk to the metadata
controller to get the catalog
information but then do the i/o directly
to the raid devices so here's your
typical XM setup diagrams a little
different than Tom's diagrams but it's
got all the same components got your
client system I'm only showing one here
a couple raid boxes and your controller
got everything hooked up with the fibre
channel and you've got the IP network
between the client and the controller in
this particular case you've got the
normal raid configuration for the
example because you've got the two
controllers we've have the two virtual
disks / raid 2 total of four lumps that
we're going to group together to be our
extra X and volume so the first thing
you do is you select one of the lungs
that's going to store your catalog data
you can decide to either dedicate that
to store in the catalog data or you can
choose to store other data alongside it
the point is though you don't want the
other day to be stored with it to be
high performance data because again
you'll start to run into the same
limitations yeah but with a file server
of the data competing with the catalog
information so even though you might
want to store other files there they
should be less accessed files then
you're more performance critic
one so in this example we've chosen one
one to store our metadata basically in
setting up your volume all you're really
doing is configuring the controller the
controller is the only machine that has
a real notion of what comprises the
volume so you tell the controller
basically what all the lungs are that
compose the volume and I'll go into a
bit of additional information you can
give it in terms of how to construct the
volume out of these ones but first to go
over our our same old example you've got
the client application making the file
level call at this stage rather than
shipping the whole call off to a file
server as it would in the normal network
attached storage case it's just going to
take the make a request to find where
the data for this file live but keeping
the actual data on the client system the
controller is going to read the catalog
information out of its private metadata
storage on 11 and then reply with the
disc offset back to the client at this
point once the client has dedicated
storage on a particular lon it can talk
directly to that one and stream the data
across without worrying about corrupting
block offset the other thing is because
you've got a collection of lunge you're
not only telling the client the offset
that you're telling it which lund though
the actual data lives on but other than
that the client has no notion of the
actual sort of file system layout on
these loans how they're divided into
files where they how the catalog data is
structured or any of that all it knows
is this is the data that this is the
area where it wants to write the data
the application is given it
the other thing you can do as Tom said
is group the lungs into storage pools
which has an effect very similar to
striping and software raid the big
difference here is even though you're
getting the same performance effect of
being able to talk to to lunds rather
than being handled by a driver on the
client system it's the metadata
controller that's going to control the
access to these two lungs so in this
particular case the clients going to ask
for that same file offset and the
controller is going to tell it that part
of this file lives on 13 and part of it
lives on 14 and the client is just smart
enough to know oh well if I'm writing to
do different ones I can stream the data
out simultaneously so you get the same
performance effect that you would and
software striving the other main job of
the controller is handle the file
locking the with the X and because
you've got the the controller handling
the catalog data you don't need to worry
about file system level corruption but
still within individual files you need
to worry about data corruption that's
the same as any network file server and
the way that hat is handled is at the
application level you need to make
locking calls this works it's the exact
same locking calls you use for NFS or
anything else but it's the controller
that needs to keep track of all those
locks and actually arbitrate between the
different clients so it has that role as
well and then other to that as I said
the clients are just writing data out
wherever the controller asset to and to
the client applications they just see it
as one big volume and they don't really
know about the lungs behind it
the other thing as mentioned before is
to failover basically how this is going
to work is when the clients notice that
a controller is down they actually get
together and vote for a backup
controller and you can configure it such
that it fails over in it in a commune Oh
predictable way the backup controller
comes up it knows where to read its
catalog data from reads out the catalog
data the journal file system so it's
able to look at the journal and
reconstruct the last few transactions
very quickly come up and start running
we're still doing some of the
performance benchmarks but as Tom said
it should just be a few seconds I think
15 seconds would be sort of the outer
limit the important thing to note is
during this time 15 seconds can be a
long time in terms of video streaming
but the clients don't always need to
talk to the metadata controller once
they've asked the metadata controller
and gotten the offsets for their files
they're dealing with directly with the
raid box so if they're streaming a file
off the raid box and the metadata
controller fails it's possible the new
one will be up again before they even
notice it's gone before they have a need
to actually talk to the metadata
controller again in which case you'll
have uninterruptedly the other thing the
clients need to do is once the
controller comes back up the new
controller of course doesn't know about
the locks the clients have taken out so
when the clients see that the new
controllers up they need to go and tell
it about all the locks they have in the
controller will rebuild the lock table
so in terms of volume configuration we
talked about the various ways that you
can group your lunge together to build
up your whole volume the first thing
you're going to need to do is to pick
which lun your metadata is going to be
stored on it's going to be both the
catalog information and the journal
technically you can configure those to
be on different lungs but usually
there's no reason to do so so in our
admin software generally they'll both
always be stored on the same line the
other thing you're going to do is decide
whether you want to store other data
files along with the catalog data
basically the catalog data doesn't take
much room so you can either partition
your raid in such a way that you have a
very small LAN and make it exclusive for
the catalog data or as we have a larger
lund you may choose to store other files
alongside it the most important thing
about the lung being used for the
metadata storage is that it be safe
storage basically if you lose your
catalog pretty much it's hard to
reconstruct your file system so it's
very important that this be mirrored
storage whereas the streaming
performance isn't so important because
we aren't talking about huge amounts of
data here the more important thing is
the random i/o performance catalog data
tends to be read in a very random order
so it's very important to have a very
responsive drive the key part of the
performance for the whole fan system in
addition to the streaming performance is
the responsiveness of the metadata
controller I mean and all the questions
about the requirement for an IP network
basically what we're trying to do is to
make that round trip time to the
metadata controller as fast as possible
so the other part of that is making sure
the storage the metadata controllers
talking to is very responsive and like I
said it's not very big ten million files
will probably only use up about 10 gig
of space for a catalog so you don't need
a lot of storage there but
high-performance storage
very important so once you've got your
metadata controller the question for the
rest of the storage is how are you going
to group it into storage pools certainly
if you take all your lungs and combine
them into storage pools the ideas the
client could talk to all the lungs at
once and theoretically get you know very
high performance there's a few
limitations to that one is you want to
make sure all the lungs have the same
characteristics because it's effectively
like software striping when you've
combined all these ones together you're
really going to get sort of the smallest
lund and the slowest lund will be the
gating factor for the whole storage pool
so you really want to take all your
identical lungs and group them into
storage pools but there's also another
aspect of storage pools and that is even
though everything's combined into a
volume once the clients done talking to
the metadata controller it's only
talking to the lungs on the storage pool
it's dealing with at the time and that
won't affect people at all if they're
talking to different raid lawns on
different rate boxes so there can be
benefits in segregating your data out so
that for instance an in gestation is
talking specifically to one lung and if
nobody else is talking to that lung even
though they may be talking to storage
pools using other loans they won't
affect the performance of that one
person that's something that only when
you're configuring your your ex n system
you know only you know how the data will
be used so that's a bit of a there's
some compromises they're involved in
setting these up and especially if you
have lots of lungs there's lots of
different ways you can do that and the
other side effect of that is that you
are going to end up then with some
storage pools that are faster and other
storage pools that are slower so as was
mentioned in the first part the problem
is because
the applications normally don't know at
all where the data goes which storage
pool gets stored on if you are setting
things up in this very particular way
you need to have more control over that
the default is that the controller will
every time you create a new file it'll
just go round robin around the storage
pools and create a file on each one
affinities are the ways you force the
files to be stored on a particular
storage pool the most common way to do
that the ones supported in our
administration app is you create folders
and basically say this folder items in
this folder will always go to this
particular storage pool the only
exception to that is if a storage pool
full fills up you'll still be able to
write into that folder it'll just then
randomly go on to another storage pool
and start putting the files there at the
finder level you'll just see the one
volume and how much room is available in
the admin you'll actually be able to
look at each storage pool and see how
much room is available on each storage
pool in addition to the administration
stuff of mapping folders there's both a
command-line tool that all talk about
for refining signing affinities
particular files and then finally an
application API where the application
itself can choose so now talk a bit
about our administration software and
how that works and how you set up ass
and using the administration software
we've tried certainly as we do in all
our products to to consolidate this and
sort of make it easy and understandable
as possible although certainly SN is a
complicated thing and there's lots of
different aspects there's always
trade-offs then in terms of giving
access to functionality versus making it
sort of easy and you know
straightforward to use so here's just a
few slides shot of what a setup looks
like the first set in the setup is to
you got to define all the
genes that are going to be part of your
sand both the clients and the
controllers it'll find these machines
over rendezvous it actually detects
athletes machine what lungs it looked up
to so it can decide which machines are
actually on the same fibre channel
network and then it'll come up it allow
you to select these machines and say yes
these are the machines that I want to be
part of my XM system you then enter the
serial numbers for those machines
because you bought a separate X sandbox
for each one you'll have a separate
serial number to enter each of those
machines and then you choose whether you
want them to be clients or controllers
and for the controllers you decide their
failover priority you can actually make
them all controllers if a machine is a
backup controller Justin standby even if
it's normally used as a client editing
system there's really no there's no
problem with that unless it actually
becomes the controller just because it's
set as a backup controller there won't
be any performance degradation and the
license allows X once you've installed X
down on system you can either make it a
client or a controller it's your choice
once you've done that you need to
configure the storage basically this is
you decide how many which volumes you
want what the storage pools in each
volume are and then what lungs are part
of each of those storage pools and then
once you've done that you basically
select the volume and tell the
controller to start up on that volume
and soon as the controller start up the
volume saleable to be mounted on client
so at this point oh actually no few
other things that does in addition to
setting up the volumes you can set up
certain administrator notifications you
can set up email or page or
notifications if storage pools fill up
or you have certain failures
and users exceed their quotas you also
can mount and unmount volumes on each of
the clients you can see you know whether
clients currently have the volume
mounted you control you know when they
mount and unmount the volumes you can do
this all from the one centralized place
you can set the user quotas or group
quotas you can view logs on the various
system and you can create the folders
with affinities as I said before so now
we'll have a demo of the various admin
functionality all right good morning
everyone you all got your ex and
developer preview CDs and hopefully you
installed it in try to play with it but
unfortunately without us and it's not
terribly interesting now I have a sand
here set up for for your enjoyment and
I'm going to make the light to blink and
I'm going to make everything happy so
the first thing to do when setting up
your xam system is to determine which
computer is going to be your meta data
controller so let's go ahead and set
this guy so we entered the chill number
before because I don't think you want to
watch me type in the serial number on
all these computers we set the role to
be controller I mean if we had multiple
controllers we could choose the failover
priority also you want since you want to
be on a private meta data network if you
have a dual nic machine or multiple
ethernet cards you could choose which
interface to access the fan easily right
there and here are some information
about the computer to help to choose
which machine you want to be hit
controller next you move on to your
lunge all you need to do with your lungs
is give them a name all the information
before is defined by your rate admin
configurations so you can just rename
that there and here's really where the
fun part is you take your storage and
create your storage you need to find
your first volume so we click on create
volume and we'll name this WWDC volume
and you can change things like the log
size and the max number of connections
you want to access to the fan now the
block allocation size is an important
field to pay attention
too and it goes in power of 2 from for
k2 512k and that is a performance doing
parameter that depending on your typical
I Oh sighs you may need to tweak and if
you need to know more about any of these
values and you'll see some more coming
up we have help buttons in all of the
sheets and it will bring up contextual
help for every single field so we
created that and we create our for
storage pool simply the same way and
let's make that cool one and let's say
we want it to be an exclusive metadata
and journaling storage pool so other
data won't interfere with that traffic
striped breath is another important
performance tuning value if you have
multiple loves in your storage pool this
is how many bytes that will write to
each lawn before moving on to the next
so we don't need to change this here
because the metadata pool will only have
11 so now that we've done that we bring
up our little drawer with lunge and I
have a pre-configured LUN this is just
by taming it that's all you have to do
to configure it drag drop there's your
metadata then actually we can come back
here and call this MV so you know it's a
metadata now we create another storage
pool for all of our data coming here
will call this video data and we want to
ensure that no journaling and metadata
spills over to interfere with our high
definition video or SD or whatever we
happen to have on here and we can change
our threat breath 2 128 blocks and this
size here 512k is how many bytes it
right and the size here also depends on
the block allocation size you'd find in
your volume and you could change
multipath method and permissions and
other stuff and the help will tell you
all about that so let's skip this disc
here because it's not the same drag all
that in there you go we've configured a
6.8 4 terabytes actually 7.29 terabytes
and in about a minute and that's all you
need to do
okay so we have a stamp set up before
with a file so we're going to pervert
and you can see here we have the meta
data pool we have a small audio pool
because we don't need as much bandwidth
for audio and we have our SD video are
high def and our post-production so
let's move over here and here you can
see all of our storage pools and you can
see a snapshot of the currently running
volume you could each of these will fill
up to show you how full each storage
pool is to know when you need to grow
your storage in the logs tab you can get
all the relevant logs on all of the
controllers all the machines on the sand
and even filter for certain things and
the clients have you can mount and
unmount you can mount them all at the
same time if you really feel so inclined
or unmount at the same time and over in
affinities we could set up all the
affinities and in quotas you can create
quotas elite quarters and it's really
simple just go ahead and you dragging
users and this is all ldap integrated so
if you have a directory server that
you'll see all the records there you can
drag in stuff here set the quoted 10
megabyte softcore I probably 10
gigabytes off quota 20 gig hard quota
and give them 24 hours and then go ahead
and hit save and it would send it out
and if you actually had some data in
here the quota status would show how
full how close they are to their soft or
hard quota or if they're even above
their soft quota and that's x 10
[Applause]
so basically we've shown you what the
admin does if you're familiar with Mac
os10 server you'll notice that it looks
awfully familiar that's because
basically leverage the same technology
as we did for the server admin the main
difference is in the server admin its
main goal was to connect to a server and
administer that one machine even though
you could mr. multiple machine from the
admin each was considered to be a sort
of separate unit in the UI the X an
admin sort of treats the whole Sam as a
particular entity so you saw that when
you're administering it your mystery in
the entire Sam at the same time
basically the server admin agent is
going to run on each of the xsan
machines both controllers and clients
the X and admin is going to take care of
replicating your configuration files
around between the machines so it's
particularly important between the
primary controller and backup
controllers that they have the same
configuration the backup controller
thought the volumes were arranged
differently that wouldn't be a good idea
so it'll make sure that the
configurations are all the same it'll be
able to monitor his status of the
machine so you can quickly look up and
see which machine are are currently
active which ones have volume is mounted
and it'll contact machines as necessary
to perform its functions it's sort of
behind the scenes it establishes
connections in addition to the admin app
we do provide a set of command line
tools as i said we try to keep the admin
app streamlined so in certain cases
there's additional functionality
available in the command line tools that
we don't actually surface through the
admin all the tools live in one place
inside library file systems there's an X
and folder and inside there there's a
binary folder that's also where config
files with another thing
the tools we'll all be documented
however if you look at the documentation
on the concurrent CD they aren't there
there are actually man files though on
the install that you can look at but
we'll try and come up with some better
documentation for these here's an
example of a few of these cv admin is
sort of the main tools the one we used a
lot when we were still developing the
user interface because it does a lot of
the same functionalities one of those
sort of interactive command-line admin
tools you can start and stop the
controller and do a lot of the various
functions see the affinity is the one
that you can use if you want finer
control over the affinities the admin
only allows you to set folders and that
everything in that folder will have a
particular affinity if you want to set
affinities on particular file or see
what the affinity on a file is currently
you can use the CD affinity tool dvfs
check is the normal FS check style
utility for the X and volume so if you
bring up disk utility you'll be able to
click on the volume and do a normal
verifier repair and it will call this
tool behind the scene but if you want to
have scripts or whatever to run it
there's a tool is available the final
one is the defrag tool that was
mentioned so the defrag tool can be used
to defragment of your file data it does
have one particular extra utility that
can be useful sometimes during dataflow
stuff you might for instance ingest a
file into one storage pool because it
has particular storage Christ particular
performance criteria but then later on
you might want to access it using
different raid Lunz so that you can
ingest new files s NFS defrag can be
used to migrate the storage for a file
from one storage pool to another without
affecting where it appears in the in the
volume basically when you look at the
file structure of the volume the file
still appears unchanged
but the actual backing store for that
file has moved to a different rate set
we mentioned the cross-platform set up
with the store next file system just
wanted to quickly go through and sort of
show how easy that is there's two
scenarios adding the store next clients
to the to the X and system all you you
can set up your XM system normally
you're going to get a license for your
store next client that will actually get
installed on the x and controller and
then you're going to just set up your
store next clients the way you would
normally do for a store next system
there's basically some information you
just need to enter into a couple config
files the trickier one is when you adn't
want to add an X an client to a store
next file system our admin software
basically is written to administer the
entire X an environment and so it
doesn't really understand a single X and
client connecting to some other type of
fan file system so in this case you're
going to need to administer the fan
manually luckily it's fairly easy to do
the main thing is you have to add your
serial number manually to the config
file and then there's just a couple
other files mainly the controller
addresses to tell how to contact the
controller so that's all quite
straightforward and it'll be fully
documented in the documentation so now I
want to talk a bit about sort of how
these volumes appear to applications the
first thing that is important is that it
is a shared volume pretty much though in
terms of writing and testing
applications it's going to be just like
a network file system in that way the
only issue you may run into is you do
find sometimes there are certain apps
that because of performance
considerations aren't used to running on
network volumes I mean if you've got
something that basically ingest
high-definition video there aren't that
many file servers that are able to
handle that bandwidth and so it may not
be used to running on a shared file
system so it is important to to make
that the applications are doing file
locking you can also be managed sort of
at the user level but it's better if the
application itself you know does the
coordination to make sure another coffee
isn't going to stomp on your data the
file system supports the normal calls
that would be done through back OS 10
you have both the file open flags the
shared lock and exclusive law as well as
the eff set lock f control for doing
bite range locking this is commonly
referred to as you know POSIX locks and
BSD locks also the open deny modes in
carbon get translated into these the
same way as they would for a NFS volume
the other thing to be aware of of course
is that these volumes are very large I
mean x-rayed volumes are already quite
large but the sand volumes are going to
be built up even larger I mean multi
terabyte volumes as you saw it's it's
really easy to build up these big
volumes because there's a tendency to
try and consolidate all your storage
even if you've got a bunch of different
rate boxes into one big volume so in
writing software that's the important
consideration that as well as having
very big file if you have these huge
volume you're going to have you know
possibly many millions of files on this
volume and so if you're writing backup
software and so forth you need to be
aware that things tend to get grouped
into larger volumes and they may have
otherwise the last point is that we did
mention that xn has its own file system
format it unlike raid where it's down at
the disc level and your formatting it
just like you would you know as an HFS
or whatever this is an ex an file system
format not an HSS volume it looks a lot
more like an NFS or aufs volume its case
sensitive and it's single for file
system so you can use carbon but it'll
do Apple double the same way as it would
on an abscess volume and basically in
terms of capabilities it's shared just
like an NFS volume is
it's just that the performance is very
different from an NFS volume so speaking
of performance the the main point of all
this is to have the fast file i/o but
even when you're contacting the server
the if things are set up properly the
server should be much more responsive
than it would be if it's a normal file
server you you're limiting the i/o to
the small io is a small request for the
metadata information you're not sending
i/o across so if you do have an IP
network or you set up your IP network
properly it should be very responsive
and the controller itself should be very
responsive in addition it's not having
to load in large file so it's able to
use its cache more for just cashing the
catalog information so hopefully it
won't have to go out to disk as often as
a normal file server would so it should
be much more responsive for a normal
file server but it'll still be less
responsive than a local file system when
it comes to these metadata operations
you're having to make a call across the
network you're calling a computer that
is potentially serving lots of thought
lots of clients the metadata controller
does end up being the bottleneck when
you're scaling up to large numbers of
clients so it's important certainly if
you're trying to support 64 clients that
have that be a very fast machine and in
terms of the application you need to be
aware that the metadata operations are
going to actually be slower than the i/o
operations there are certain ways you
can tune your app to to deal with that
I'm going to talk a little bit more
about prevention expense but the other
thing is to just track I mean catalog
informations tend to be the ones that
have to go out to the metadata
controller so it's good to minimize
those in terms of the i/o you want but
the file i/o is going to be going
directly to the raid so in that case it
shouldn't be any different than if you
add a locally mounted raid volume
so now I'm going to talk about few few
api's you can use certainly it's
expected that you'll hat start to have
server clusters that we'll be using as
the ex am and so server app may want to
take advantage of some of these features
distributing computing apps and then
certainly multimedia apses is a very
strong focus so three api's i'm going to
highlight i'll mention a couple other
minor ones but the extent pre-loading
and the affinities and then the
bandwidth reservation that Tom mentioned
earlier the api's all use a similar
mechanism their specific 2x and volumes
they're going to be accessed through
sysctl but basically we have some sample
code that sort of helps you call it
because the actual glue code is is the
bit growth so and the other thing to
note is that the API is still in flux so
we provide some sample code on the on
the CD but if you compiled using that
sample code you would need to recompile
before this final shipping version comes
out so it's there just to try out the
api's and see how they work but we'll be
feeding final api's closer to the ship
date and then the last thing is because
these are x and specific api's you
should use data to determine whether
this is an X and volume you're talking
to here's some easy code basically just
going to call SATA FS on this on the
file and the FF type name will be unique
to X and we actually have a constant in
the header that you can compare against
so here's an example of a typical sort
of lovely sysctl call you've got your
structure that you're going to pass down
into the kernel and it's going to get
filled out and then pass back up this is
a easy call in that it's just getting
some version information the other thing
about this API is you always need to
open file descriptor to make the call
obviously in version information you
don't really care what's not particular
doing individual file so a common thing
to do is to just open up the root
directory of the file system and make a
call on that but this particular call
will return the same information no
matter what file descriptor it's called
against as long as that's for a file on
an x and volume so the load extents call
the key here is when you open a file and
start reading and writing it the file
system is going to react to your calls
as you make them in my example you saw
there's the right file system call the
file system needs to go out ask the
metadata controller you know where the
file is before it can start writing the
data out but because you have that
latency and talking to the metadata
controller you you may have a hiccup in
terms of the reading and writing the
load extents call can tell this system
up front that you're going to be reading
a riot II in these offsets for this
particular file and tell it to go ahead
and get all that information up front
but when you're actually doing the i/o
you don't have any of the Layton sees of
talking to the controller so the
affinities the thing here is often you
don't want the layout on the file system
to necessarily reflect where things are
stored in storage pools common example
of this is you might have a project
folder that project folder could contain
audio files and video files but as far
as the users concerned they want all
these files group to
they're in a single folder but as far as
the system is concerned you may want to
store the audio files on a different
storage pool than the video files the
most efficient way because configuring
that all sort of by hand could be a very
complicated thing applications can take
advantage of this because they know what
types of files they're saving out and
what the characteristics of those files
are going to be so we're going to have a
demo of this all right so I'm going to
demo affinity steering open up my demo
app sorry I don't have a nice icon so
we'll create a file called my file and
we're going to put it in the post
production storage pool so I'm gonna go
ahead and start that and you can see
it's writing at reasonable speeds you'll
get a initial burst of speed as it fills
the raid cash but then it will level out
and if you can see the lights over here
you should see one lun being pegged how
if we start up a second file say this is
our video file video file dot MPEG and
we store that saying hi def we start
that up we should be pegging two
different lines or two different storage
pools and two months and it should be
going much faster which it is now if you
go ahead and look in our volume it was
stored it in project files they're both
sitting right next to each other and
that is affinity steering
so the one of the points on that demo
was you know basically we wanted to show
the difference between the storage pools
but we only have one fiber channel
connected up to the system and didn't
really tune it so don't take those
performance numbers and it's typical
performance numbers but we just wanted
to show the ways in which an application
can can talk to the different storage
basically how it's going to do that is
first it needs to find out what storage
pools are available early on the we
called storage pool Strike groups so
that's still reflected in the API get SG
info it's going to give you information
about the stripe group so theoretically
an application might be able to do some
intelligent figuring out of which stripe
group it wants to use the other thing
that's probably more common is to do
what we did in the demo app which is
basically just present a pop-up to the
to the user to choose which storage fool
they want in the API it's also going to
give you an 8-byte key for that storage
pool that's what you actually use in the
set affinity call so normally you would
open a file and create the file using
open set the affinity on the file and
then start riding it the other thing you
can do what we did actually in this app
was called alec extent space which will
pre-allocate the space for the file load
the extents into the client and then
allow you to start writing out and that
gives you the the highest performance
riding
so the next thing I want to talk about
is bandwidth reservation basically I
mean Tom described this pretty well for
the people that weren't here i'll try my
little big thing basically the idea here
is that when you're doing a critical
operation you can't necessarily control
what other people are going to be
accessing the sand so if you've got your
in gestation and it's really critical
that you get your high-definition video
you know streamed on to there without
any hiccup you don't want somebody else
coming up and just starting to stream
some other file on or off that same
storage pool and mess up the bandwidth
so basically this is a way for
applications to to guarantee that
they're going to get a particular amount
of bandwidth if somebody else launches
something somebody else launches
something where they don't care about
the performance it'll just get scaled
back if somebody launches an application
that's also demanding the critical
performance they'll get an error saying
look this is already in use this amount
of bandwidth has already been reserved
so there isn't enough left for you to do
your application this is basically used
for streaming especially real time
streaming and it's / storage pool
because as I said earlier if somebody's
writing to one storage pool it doesn't
affect the i/o to another storage pool
anyway so the you're reserving bandwidth
on a particular storage pool and people
reading arriving other storage pools
won't be affected by the reservation so
we'll have a demo of this alright so
we're going to be the same application
and say we have a video file and an
audio file and we're going off and
running those the same storage pool
high-def now say we want 120 megabytes
per second and unfortunately we're
sharing it at about 80 or 90 mega second
over the single fiber channel and go
ahead and attempt to reserve bandwidth
here and that will jump up while the
other one goes down and this is being
written to the exact same storage pool
and they're still sitting right next to
each other but one is getting more
bandwidth than the other as much as we
you had configured it to meet and there
you go that is balanced reservation so
bandwidth reservation is the one feature
that only works with applications
support it because it's the application
has to tell the system you know what
file it is that they want to reserve
bandwidth for and and how much bandwidth
needs to be reserved the other thing
about it is that it requires additional
configuration that we don't support in
the in the admin API it's pretty simple
basically the system when you configure
the volume it isn't able to determine
what the throughput to your varied
storage pools is it's a very hard thing
to determine you know programmatically
there's a lot of variables involved so
you just need to run a simple test run
an app like the one we just had find out
what the throughput to your storage pool
is and just enter that field into the
configuration file and then it'll 0
basically how much is able to be
reserved off of that another important
thing you can add is to tell it how much
you don't want to be able to be reserved
as you saw once somebody makes the
reservation the rest of the performance
is going to drop way down to to allow
the person to have that bandwidth it's
critical that you don't have everybody
else dropped to zero and have one person
who reserved the entire bandwidth
because that can lead to dead locks and
other problems so at a minimum the
system leaves 1 megabyte per second free
that other people can at least do very
slow I owe to that storage pool but
under certain circumstances you may
actually want to increase that so
there's another field to determine the
non reservable part of the bandwidth
so the call is basically set real time I
oh the idea is that you're going to put
the storage pool into real-time mode
that means that basically once a client
has loaded the extents normally it's
just doing the file i/o as I said it
usually doesn't even care whether the
metadata controllers still around it's
doing it file i/o it's happy but once
the you put the storage pool into
real-time mode it goes out and tells all
the clients that are using the storage
pool that they now need to make requests
for i/o so each client will then ask the
bandwidth ask the controller say
basically I want to do I owe to this
storage pool the controller will give it
a token allowing it to do a particular
amount of i/o for a certain time slice
depending on how many clients are asking
for i 0 it'll partial partial out
different amounts of i/o to the
different clients and this is will
balance that sort of as you know
dynamically as time goes on the
important thing is that the person who's
reserving the bandwidth is when they
make the call they're going to specify a
file descriptor that's going to be used
for the performance critical operation
and that file descriptor it's it will
not be limited you'll be able to make
reads and writes freely to that file
descriptor there's actually another call
you can make if you have multiple file
descriptors you can make another call to
enable multiple file descriptors to be
ungated
that's basically my session you know you
saw how xn can allow you to configure
your lunge together in a much more
flexible way and and get the performance
to all of them out to various clients
the important points I want to make our
it that the it is a shared file system
that's a very important thing that
applications need to be aware of and
that there are these api's available to
to add additional value to applications
if you're writing our next Sam system
and then I think we will have Q&A oh
well more information basically these
are the documents that are available on
the on the CD and then I think tom is
going to come back up for Q&A or Eric
[Applause]