---
title: WWDC2004 Session 615
framework: wwdc
role: article
path: wwdc/wwdc2004-615
---

# WWDC2004 Session 615

## Transcript

Kind: captions Language: en hello everybody this is the second of our 2 X and sessions X an in-depth basically i'm greg vaughn and i'm going to be talking of a lot of the same material that was in the overview session how many of you were here for the overview session okay good i'm going to cover some of the same material but from a slightly different perspective i won't have any product prices up here stand a little code samples and hopefully some more technical information going to start actually going over some of the same stuff with the different types of file systems just to separate them out and make it clear sort of in terms of file systems what makes a sandpile system different from other types i'm going to talk specifically about x am and sort of the communication protocols and how it's really working i'll go into a bit of depth on the x an admin will show a demo of setting up the sand some of the other features of the admin i'll talk about how the volumes work in terms of when you're writing apps there's some different characteristics both from local volumes and network volumes you need to be aware of and then finally I'll talk specifically about some developer api's that can be used in the applications to make them work better with x and b more x and we're so starting out with the different file system pipes Tom mentioned direct-attached storage which is basically you know your traditional local hard drive just some new fancy names for it the way in which a local hard drive works is you've got your file system the drives presenting just an array of blocks the job of the file system is to organize that data using the catalog present a higher level API up to applications so in a typical diagram you've got basically at the high level API you're dealing in terms of files the application is going to open a file write to a file use biol offset the file system is going to translate that to the blocks on the desk this is the technology that's been around for decades but of course a block-level device can't be shared by multiple computers because there's no way to keep the catalog in sync between the multiple computers so this early limitation got solved by the network attached storage or the file server network attached storage basically is just taking the exact same file server and put it into a box so the underlying technology is exactly the same the idea here is you're going to take the high-level call ship it across the network to the server and the server will perform the action so you've got the same diagram you're making the call at the file system layer the network file system directly mirrors that call and on the server when it makes the call you're doing the offsets to the but disk this will allow you basically to have the volume integrity but at the price of bundling all the data over the network to the server and all funneling it through the single machine the other slight thing about that in dealing with file servers is when you scale them up you tend to have problems with the different types of requests especially metadata intensive requests doing large directory listings and so forth cause the server to load catalog blocks off which can interfere with the streaming of data off the hard drive and conversely those metadata requests can be locked behind the large iOS and when you're dealing with heavy file servers you see the big Layton sees and like opening up a little directory listing and that those are really hard to overcome those sort of scalability limitations and file servers raid solves the other part of the problem you're you're trying to overcome the limitations of disk speed by combining the multiple drives in addition to the performance you're giving the reliability of redundancy between the draw I've important thing about the raid is that it's happening sort of behind the scenes of the filesystem the file system isn't aware of that the raid system is presenting the single Drive to the file system and the fight and the raids mapping the block offset internally to the multiple drive and you've got the different raid scheme I always get them confused specially raid 0 and raid 1 but raid 0 you got the striping mirroring provides the redundancy and then with raid 5 you've got both the performance of writing to multiple drives plus the redundant data so that if anyone drive fails it can be rebuilt easily without losing access to your data in addition to the underlying raid you've got software raid which happens at the driver level to distribute do your data out to the multiple drive and then once you've done that then the raid box will further distribute it out to the individual disks but the point of all this is that the raid system is all happening down at the block line device the file system is still your traditional file system be it HFS or ufs that is just dealing with what it sees as a single array of blocks and it's maintaining the same catalog data as it always would so it's just doing the same translation down to the disk offset and then down at the raid level that's being distributed out to the multiple disc so the problem with that is you've still faced with the file server if you want to distribute your dear data even though you've sped up the hard drive all the data is flowing through the one file server and that is your big limitation so how the fan file system really works to overcome this is by separating out the notion of the catalog from the data there's no real reason why the catalog needs to live with the data once you've decided a one particular place to store your catalog you can have a special purpose server that is going to just deal with the catalog and do part of the job of a traditional file system basically update the catalog and figure out where on the drive the data is going to actually live that way the client file system can talk to the metadata controller to get the catalog information but then do the i/o directly to the raid devices so here's your typical XM setup diagrams a little different than Tom's diagrams but it's got all the same components got your client system I'm only showing one here a couple raid boxes and your controller got everything hooked up with the fibre channel and you've got the IP network between the client and the controller in this particular case you've got the normal raid configuration for the example because you've got the two controllers we've have the two virtual disks / raid 2 total of four lumps that we're going to group together to be our extra X and volume so the first thing you do is you select one of the lungs that's going to store your catalog data you can decide to either dedicate that to store in the catalog data or you can choose to store other data alongside it the point is though you don't want the other day to be stored with it to be high performance data because again you'll start to run into the same limitations yeah but with a file server of the data competing with the catalog information so even though you might want to store other files there they should be less accessed files then you're more performance critic one so in this example we've chosen one one to store our metadata basically in setting up your volume all you're really doing is configuring the controller the controller is the only machine that has a real notion of what comprises the volume so you tell the controller basically what all the lungs are that compose the volume and I'll go into a bit of additional information you can give it in terms of how to construct the volume out of these ones but first to go over our our same old example you've got the client application making the file level call at this stage rather than shipping the whole call off to a file server as it would in the normal network attached storage case it's just going to take the make a request to find where the data for this file live but keeping the actual data on the client system the controller is going to read the catalog information out of its private metadata storage on 11 and then reply with the disc offset back to the client at this point once the client has dedicated storage on a particular lon it can talk directly to that one and stream the data across without worrying about corrupting block offset the other thing is because you've got a collection of lunge you're not only telling the client the offset that you're telling it which lund though the actual data lives on but other than that the client has no notion of the actual sort of file system layout on these loans how they're divided into files where they how the catalog data is structured or any of that all it knows is this is the data that this is the area where it wants to write the data the application is given it the other thing you can do as Tom said is group the lungs into storage pools which has an effect very similar to striping and software raid the big difference here is even though you're getting the same performance effect of being able to talk to to lunds rather than being handled by a driver on the client system it's the metadata controller that's going to control the access to these two lungs so in this particular case the clients going to ask for that same file offset and the controller is going to tell it that part of this file lives on 13 and part of it lives on 14 and the client is just smart enough to know oh well if I'm writing to do different ones I can stream the data out simultaneously so you get the same performance effect that you would and software striving the other main job of the controller is handle the file locking the with the X and because you've got the the controller handling the catalog data you don't need to worry about file system level corruption but still within individual files you need to worry about data corruption that's the same as any network file server and the way that hat is handled is at the application level you need to make locking calls this works it's the exact same locking calls you use for NFS or anything else but it's the controller that needs to keep track of all those locks and actually arbitrate between the different clients so it has that role as well and then other to that as I said the clients are just writing data out wherever the controller asset to and to the client applications they just see it as one big volume and they don't really know about the lungs behind it the other thing as mentioned before is to failover basically how this is going to work is when the clients notice that a controller is down they actually get together and vote for a backup controller and you can configure it such that it fails over in it in a commune Oh predictable way the backup controller comes up it knows where to read its catalog data from reads out the catalog data the journal file system so it's able to look at the journal and reconstruct the last few transactions very quickly come up and start running we're still doing some of the performance benchmarks but as Tom said it should just be a few seconds I think 15 seconds would be sort of the outer limit the important thing to note is during this time 15 seconds can be a long time in terms of video streaming but the clients don't always need to talk to the metadata controller once they've asked the metadata controller and gotten the offsets for their files they're dealing with directly with the raid box so if they're streaming a file off the raid box and the metadata controller fails it's possible the new one will be up again before they even notice it's gone before they have a need to actually talk to the metadata controller again in which case you'll have uninterruptedly the other thing the clients need to do is once the controller comes back up the new controller of course doesn't know about the locks the clients have taken out so when the clients see that the new controllers up they need to go and tell it about all the locks they have in the controller will rebuild the lock table so in terms of volume configuration we talked about the various ways that you can group your lunge together to build up your whole volume the first thing you're going to need to do is to pick which lun your metadata is going to be stored on it's going to be both the catalog information and the journal technically you can configure those to be on different lungs but usually there's no reason to do so so in our admin software generally they'll both always be stored on the same line the other thing you're going to do is decide whether you want to store other data files along with the catalog data basically the catalog data doesn't take much room so you can either partition your raid in such a way that you have a very small LAN and make it exclusive for the catalog data or as we have a larger lund you may choose to store other files alongside it the most important thing about the lung being used for the metadata storage is that it be safe storage basically if you lose your catalog pretty much it's hard to reconstruct your file system so it's very important that this be mirrored storage whereas the streaming performance isn't so important because we aren't talking about huge amounts of data here the more important thing is the random i/o performance catalog data tends to be read in a very random order so it's very important to have a very responsive drive the key part of the performance for the whole fan system in addition to the streaming performance is the responsiveness of the metadata controller I mean and all the questions about the requirement for an IP network basically what we're trying to do is to make that round trip time to the metadata controller as fast as possible so the other part of that is making sure the storage the metadata controllers talking to is very responsive and like I said it's not very big ten million files will probably only use up about 10 gig of space for a catalog so you don't need a lot of storage there but high-performance storage very important so once you've got your metadata controller the question for the rest of the storage is how are you going to group it into storage pools certainly if you take all your lungs and combine them into storage pools the ideas the client could talk to all the lungs at once and theoretically get you know very high performance there's a few limitations to that one is you want to make sure all the lungs have the same characteristics because it's effectively like software striping when you've combined all these ones together you're really going to get sort of the smallest lund and the slowest lund will be the gating factor for the whole storage pool so you really want to take all your identical lungs and group them into storage pools but there's also another aspect of storage pools and that is even though everything's combined into a volume once the clients done talking to the metadata controller it's only talking to the lungs on the storage pool it's dealing with at the time and that won't affect people at all if they're talking to different raid lawns on different rate boxes so there can be benefits in segregating your data out so that for instance an in gestation is talking specifically to one lung and if nobody else is talking to that lung even though they may be talking to storage pools using other loans they won't affect the performance of that one person that's something that only when you're configuring your your ex n system you know only you know how the data will be used so that's a bit of a there's some compromises they're involved in setting these up and especially if you have lots of lungs there's lots of different ways you can do that and the other side effect of that is that you are going to end up then with some storage pools that are faster and other storage pools that are slower so as was mentioned in the first part the problem is because the applications normally don't know at all where the data goes which storage pool gets stored on if you are setting things up in this very particular way you need to have more control over that the default is that the controller will every time you create a new file it'll just go round robin around the storage pools and create a file on each one affinities are the ways you force the files to be stored on a particular storage pool the most common way to do that the ones supported in our administration app is you create folders and basically say this folder items in this folder will always go to this particular storage pool the only exception to that is if a storage pool full fills up you'll still be able to write into that folder it'll just then randomly go on to another storage pool and start putting the files there at the finder level you'll just see the one volume and how much room is available in the admin you'll actually be able to look at each storage pool and see how much room is available on each storage pool in addition to the administration stuff of mapping folders there's both a command-line tool that all talk about for refining signing affinities particular files and then finally an application API where the application itself can choose so now talk a bit about our administration software and how that works and how you set up ass and using the administration software we've tried certainly as we do in all our products to to consolidate this and sort of make it easy and understandable as possible although certainly SN is a complicated thing and there's lots of different aspects there's always trade-offs then in terms of giving access to functionality versus making it sort of easy and you know straightforward to use so here's just a few slides shot of what a setup looks like the first set in the setup is to you got to define all the genes that are going to be part of your sand both the clients and the controllers it'll find these machines over rendezvous it actually detects athletes machine what lungs it looked up to so it can decide which machines are actually on the same fibre channel network and then it'll come up it allow you to select these machines and say yes these are the machines that I want to be part of my XM system you then enter the serial numbers for those machines because you bought a separate X sandbox for each one you'll have a separate serial number to enter each of those machines and then you choose whether you want them to be clients or controllers and for the controllers you decide their failover priority you can actually make them all controllers if a machine is a backup controller Justin standby even if it's normally used as a client editing system there's really no there's no problem with that unless it actually becomes the controller just because it's set as a backup controller there won't be any performance degradation and the license allows X once you've installed X down on system you can either make it a client or a controller it's your choice once you've done that you need to configure the storage basically this is you decide how many which volumes you want what the storage pools in each volume are and then what lungs are part of each of those storage pools and then once you've done that you basically select the volume and tell the controller to start up on that volume and soon as the controller start up the volume saleable to be mounted on client so at this point oh actually no few other things that does in addition to setting up the volumes you can set up certain administrator notifications you can set up email or page or notifications if storage pools fill up or you have certain failures and users exceed their quotas you also can mount and unmount volumes on each of the clients you can see you know whether clients currently have the volume mounted you control you know when they mount and unmount the volumes you can do this all from the one centralized place you can set the user quotas or group quotas you can view logs on the various system and you can create the folders with affinities as I said before so now we'll have a demo of the various admin functionality all right good morning everyone you all got your ex and developer preview CDs and hopefully you installed it in try to play with it but unfortunately without us and it's not terribly interesting now I have a sand here set up for for your enjoyment and I'm going to make the light to blink and I'm going to make everything happy so the first thing to do when setting up your xam system is to determine which computer is going to be your meta data controller so let's go ahead and set this guy so we entered the chill number before because I don't think you want to watch me type in the serial number on all these computers we set the role to be controller I mean if we had multiple controllers we could choose the failover priority also you want since you want to be on a private meta data network if you have a dual nic machine or multiple ethernet cards you could choose which interface to access the fan easily right there and here are some information about the computer to help to choose which machine you want to be hit controller next you move on to your lunge all you need to do with your lungs is give them a name all the information before is defined by your rate admin configurations so you can just rename that there and here's really where the fun part is you take your storage and create your storage you need to find your first volume so we click on create volume and we'll name this WWDC volume and you can change things like the log size and the max number of connections you want to access to the fan now the block allocation size is an important field to pay attention too and it goes in power of 2 from for k2 512k and that is a performance doing parameter that depending on your typical I Oh sighs you may need to tweak and if you need to know more about any of these values and you'll see some more coming up we have help buttons in all of the sheets and it will bring up contextual help for every single field so we created that and we create our for storage pool simply the same way and let's make that cool one and let's say we want it to be an exclusive metadata and journaling storage pool so other data won't interfere with that traffic striped breath is another important performance tuning value if you have multiple loves in your storage pool this is how many bytes that will write to each lawn before moving on to the next so we don't need to change this here because the metadata pool will only have 11 so now that we've done that we bring up our little drawer with lunge and I have a pre-configured LUN this is just by taming it that's all you have to do to configure it drag drop there's your metadata then actually we can come back here and call this MV so you know it's a metadata now we create another storage pool for all of our data coming here will call this video data and we want to ensure that no journaling and metadata spills over to interfere with our high definition video or SD or whatever we happen to have on here and we can change our threat breath 2 128 blocks and this size here 512k is how many bytes it right and the size here also depends on the block allocation size you'd find in your volume and you could change multipath method and permissions and other stuff and the help will tell you all about that so let's skip this disc here because it's not the same drag all that in there you go we've configured a 6.8 4 terabytes actually 7.29 terabytes and in about a minute and that's all you need to do okay so we have a stamp set up before with a file so we're going to pervert and you can see here we have the meta data pool we have a small audio pool because we don't need as much bandwidth for audio and we have our SD video are high def and our post-production so let's move over here and here you can see all of our storage pools and you can see a snapshot of the currently running volume you could each of these will fill up to show you how full each storage pool is to know when you need to grow your storage in the logs tab you can get all the relevant logs on all of the controllers all the machines on the sand and even filter for certain things and the clients have you can mount and unmount you can mount them all at the same time if you really feel so inclined or unmount at the same time and over in affinities we could set up all the affinities and in quotas you can create quotas elite quarters and it's really simple just go ahead and you dragging users and this is all ldap integrated so if you have a directory server that you'll see all the records there you can drag in stuff here set the quoted 10 megabyte softcore I probably 10 gigabytes off quota 20 gig hard quota and give them 24 hours and then go ahead and hit save and it would send it out and if you actually had some data in here the quota status would show how full how close they are to their soft or hard quota or if they're even above their soft quota and that's x 10 [Applause] so basically we've shown you what the admin does if you're familiar with Mac os10 server you'll notice that it looks awfully familiar that's because basically leverage the same technology as we did for the server admin the main difference is in the server admin its main goal was to connect to a server and administer that one machine even though you could mr. multiple machine from the admin each was considered to be a sort of separate unit in the UI the X an admin sort of treats the whole Sam as a particular entity so you saw that when you're administering it your mystery in the entire Sam at the same time basically the server admin agent is going to run on each of the xsan machines both controllers and clients the X and admin is going to take care of replicating your configuration files around between the machines so it's particularly important between the primary controller and backup controllers that they have the same configuration the backup controller thought the volumes were arranged differently that wouldn't be a good idea so it'll make sure that the configurations are all the same it'll be able to monitor his status of the machine so you can quickly look up and see which machine are are currently active which ones have volume is mounted and it'll contact machines as necessary to perform its functions it's sort of behind the scenes it establishes connections in addition to the admin app we do provide a set of command line tools as i said we try to keep the admin app streamlined so in certain cases there's additional functionality available in the command line tools that we don't actually surface through the admin all the tools live in one place inside library file systems there's an X and folder and inside there there's a binary folder that's also where config files with another thing the tools we'll all be documented however if you look at the documentation on the concurrent CD they aren't there there are actually man files though on the install that you can look at but we'll try and come up with some better documentation for these here's an example of a few of these cv admin is sort of the main tools the one we used a lot when we were still developing the user interface because it does a lot of the same functionalities one of those sort of interactive command-line admin tools you can start and stop the controller and do a lot of the various functions see the affinity is the one that you can use if you want finer control over the affinities the admin only allows you to set folders and that everything in that folder will have a particular affinity if you want to set affinities on particular file or see what the affinity on a file is currently you can use the CD affinity tool dvfs check is the normal FS check style utility for the X and volume so if you bring up disk utility you'll be able to click on the volume and do a normal verifier repair and it will call this tool behind the scene but if you want to have scripts or whatever to run it there's a tool is available the final one is the defrag tool that was mentioned so the defrag tool can be used to defragment of your file data it does have one particular extra utility that can be useful sometimes during dataflow stuff you might for instance ingest a file into one storage pool because it has particular storage Christ particular performance criteria but then later on you might want to access it using different raid Lunz so that you can ingest new files s NFS defrag can be used to migrate the storage for a file from one storage pool to another without affecting where it appears in the in the volume basically when you look at the file structure of the volume the file still appears unchanged but the actual backing store for that file has moved to a different rate set we mentioned the cross-platform set up with the store next file system just wanted to quickly go through and sort of show how easy that is there's two scenarios adding the store next clients to the to the X and system all you you can set up your XM system normally you're going to get a license for your store next client that will actually get installed on the x and controller and then you're going to just set up your store next clients the way you would normally do for a store next system there's basically some information you just need to enter into a couple config files the trickier one is when you adn't want to add an X an client to a store next file system our admin software basically is written to administer the entire X an environment and so it doesn't really understand a single X and client connecting to some other type of fan file system so in this case you're going to need to administer the fan manually luckily it's fairly easy to do the main thing is you have to add your serial number manually to the config file and then there's just a couple other files mainly the controller addresses to tell how to contact the controller so that's all quite straightforward and it'll be fully documented in the documentation so now I want to talk a bit about sort of how these volumes appear to applications the first thing that is important is that it is a shared volume pretty much though in terms of writing and testing applications it's going to be just like a network file system in that way the only issue you may run into is you do find sometimes there are certain apps that because of performance considerations aren't used to running on network volumes I mean if you've got something that basically ingest high-definition video there aren't that many file servers that are able to handle that bandwidth and so it may not be used to running on a shared file system so it is important to to make that the applications are doing file locking you can also be managed sort of at the user level but it's better if the application itself you know does the coordination to make sure another coffee isn't going to stomp on your data the file system supports the normal calls that would be done through back OS 10 you have both the file open flags the shared lock and exclusive law as well as the eff set lock f control for doing bite range locking this is commonly referred to as you know POSIX locks and BSD locks also the open deny modes in carbon get translated into these the same way as they would for a NFS volume the other thing to be aware of of course is that these volumes are very large I mean x-rayed volumes are already quite large but the sand volumes are going to be built up even larger I mean multi terabyte volumes as you saw it's it's really easy to build up these big volumes because there's a tendency to try and consolidate all your storage even if you've got a bunch of different rate boxes into one big volume so in writing software that's the important consideration that as well as having very big file if you have these huge volume you're going to have you know possibly many millions of files on this volume and so if you're writing backup software and so forth you need to be aware that things tend to get grouped into larger volumes and they may have otherwise the last point is that we did mention that xn has its own file system format it unlike raid where it's down at the disc level and your formatting it just like you would you know as an HFS or whatever this is an ex an file system format not an HSS volume it looks a lot more like an NFS or aufs volume its case sensitive and it's single for file system so you can use carbon but it'll do Apple double the same way as it would on an abscess volume and basically in terms of capabilities it's shared just like an NFS volume is it's just that the performance is very different from an NFS volume so speaking of performance the the main point of all this is to have the fast file i/o but even when you're contacting the server the if things are set up properly the server should be much more responsive than it would be if it's a normal file server you you're limiting the i/o to the small io is a small request for the metadata information you're not sending i/o across so if you do have an IP network or you set up your IP network properly it should be very responsive and the controller itself should be very responsive in addition it's not having to load in large file so it's able to use its cache more for just cashing the catalog information so hopefully it won't have to go out to disk as often as a normal file server would so it should be much more responsive for a normal file server but it'll still be less responsive than a local file system when it comes to these metadata operations you're having to make a call across the network you're calling a computer that is potentially serving lots of thought lots of clients the metadata controller does end up being the bottleneck when you're scaling up to large numbers of clients so it's important certainly if you're trying to support 64 clients that have that be a very fast machine and in terms of the application you need to be aware that the metadata operations are going to actually be slower than the i/o operations there are certain ways you can tune your app to to deal with that I'm going to talk a little bit more about prevention expense but the other thing is to just track I mean catalog informations tend to be the ones that have to go out to the metadata controller so it's good to minimize those in terms of the i/o you want but the file i/o is going to be going directly to the raid so in that case it shouldn't be any different than if you add a locally mounted raid volume so now I'm going to talk about few few api's you can use certainly it's expected that you'll hat start to have server clusters that we'll be using as the ex am and so server app may want to take advantage of some of these features distributing computing apps and then certainly multimedia apses is a very strong focus so three api's i'm going to highlight i'll mention a couple other minor ones but the extent pre-loading and the affinities and then the bandwidth reservation that Tom mentioned earlier the api's all use a similar mechanism their specific 2x and volumes they're going to be accessed through sysctl but basically we have some sample code that sort of helps you call it because the actual glue code is is the bit growth so and the other thing to note is that the API is still in flux so we provide some sample code on the on the CD but if you compiled using that sample code you would need to recompile before this final shipping version comes out so it's there just to try out the api's and see how they work but we'll be feeding final api's closer to the ship date and then the last thing is because these are x and specific api's you should use data to determine whether this is an X and volume you're talking to here's some easy code basically just going to call SATA FS on this on the file and the FF type name will be unique to X and we actually have a constant in the header that you can compare against so here's an example of a typical sort of lovely sysctl call you've got your structure that you're going to pass down into the kernel and it's going to get filled out and then pass back up this is a easy call in that it's just getting some version information the other thing about this API is you always need to open file descriptor to make the call obviously in version information you don't really care what's not particular doing individual file so a common thing to do is to just open up the root directory of the file system and make a call on that but this particular call will return the same information no matter what file descriptor it's called against as long as that's for a file on an x and volume so the load extents call the key here is when you open a file and start reading and writing it the file system is going to react to your calls as you make them in my example you saw there's the right file system call the file system needs to go out ask the metadata controller you know where the file is before it can start writing the data out but because you have that latency and talking to the metadata controller you you may have a hiccup in terms of the reading and writing the load extents call can tell this system up front that you're going to be reading a riot II in these offsets for this particular file and tell it to go ahead and get all that information up front but when you're actually doing the i/o you don't have any of the Layton sees of talking to the controller so the affinities the thing here is often you don't want the layout on the file system to necessarily reflect where things are stored in storage pools common example of this is you might have a project folder that project folder could contain audio files and video files but as far as the users concerned they want all these files group to they're in a single folder but as far as the system is concerned you may want to store the audio files on a different storage pool than the video files the most efficient way because configuring that all sort of by hand could be a very complicated thing applications can take advantage of this because they know what types of files they're saving out and what the characteristics of those files are going to be so we're going to have a demo of this all right so I'm going to demo affinity steering open up my demo app sorry I don't have a nice icon so we'll create a file called my file and we're going to put it in the post production storage pool so I'm gonna go ahead and start that and you can see it's writing at reasonable speeds you'll get a initial burst of speed as it fills the raid cash but then it will level out and if you can see the lights over here you should see one lun being pegged how if we start up a second file say this is our video file video file dot MPEG and we store that saying hi def we start that up we should be pegging two different lines or two different storage pools and two months and it should be going much faster which it is now if you go ahead and look in our volume it was stored it in project files they're both sitting right next to each other and that is affinity steering so the one of the points on that demo was you know basically we wanted to show the difference between the storage pools but we only have one fiber channel connected up to the system and didn't really tune it so don't take those performance numbers and it's typical performance numbers but we just wanted to show the ways in which an application can can talk to the different storage basically how it's going to do that is first it needs to find out what storage pools are available early on the we called storage pool Strike groups so that's still reflected in the API get SG info it's going to give you information about the stripe group so theoretically an application might be able to do some intelligent figuring out of which stripe group it wants to use the other thing that's probably more common is to do what we did in the demo app which is basically just present a pop-up to the to the user to choose which storage fool they want in the API it's also going to give you an 8-byte key for that storage pool that's what you actually use in the set affinity call so normally you would open a file and create the file using open set the affinity on the file and then start riding it the other thing you can do what we did actually in this app was called alec extent space which will pre-allocate the space for the file load the extents into the client and then allow you to start writing out and that gives you the the highest performance riding so the next thing I want to talk about is bandwidth reservation basically I mean Tom described this pretty well for the people that weren't here i'll try my little big thing basically the idea here is that when you're doing a critical operation you can't necessarily control what other people are going to be accessing the sand so if you've got your in gestation and it's really critical that you get your high-definition video you know streamed on to there without any hiccup you don't want somebody else coming up and just starting to stream some other file on or off that same storage pool and mess up the bandwidth so basically this is a way for applications to to guarantee that they're going to get a particular amount of bandwidth if somebody else launches something somebody else launches something where they don't care about the performance it'll just get scaled back if somebody launches an application that's also demanding the critical performance they'll get an error saying look this is already in use this amount of bandwidth has already been reserved so there isn't enough left for you to do your application this is basically used for streaming especially real time streaming and it's / storage pool because as I said earlier if somebody's writing to one storage pool it doesn't affect the i/o to another storage pool anyway so the you're reserving bandwidth on a particular storage pool and people reading arriving other storage pools won't be affected by the reservation so we'll have a demo of this alright so we're going to be the same application and say we have a video file and an audio file and we're going off and running those the same storage pool high-def now say we want 120 megabytes per second and unfortunately we're sharing it at about 80 or 90 mega second over the single fiber channel and go ahead and attempt to reserve bandwidth here and that will jump up while the other one goes down and this is being written to the exact same storage pool and they're still sitting right next to each other but one is getting more bandwidth than the other as much as we you had configured it to meet and there you go that is balanced reservation so bandwidth reservation is the one feature that only works with applications support it because it's the application has to tell the system you know what file it is that they want to reserve bandwidth for and and how much bandwidth needs to be reserved the other thing about it is that it requires additional configuration that we don't support in the in the admin API it's pretty simple basically the system when you configure the volume it isn't able to determine what the throughput to your varied storage pools is it's a very hard thing to determine you know programmatically there's a lot of variables involved so you just need to run a simple test run an app like the one we just had find out what the throughput to your storage pool is and just enter that field into the configuration file and then it'll 0 basically how much is able to be reserved off of that another important thing you can add is to tell it how much you don't want to be able to be reserved as you saw once somebody makes the reservation the rest of the performance is going to drop way down to to allow the person to have that bandwidth it's critical that you don't have everybody else dropped to zero and have one person who reserved the entire bandwidth because that can lead to dead locks and other problems so at a minimum the system leaves 1 megabyte per second free that other people can at least do very slow I owe to that storage pool but under certain circumstances you may actually want to increase that so there's another field to determine the non reservable part of the bandwidth so the call is basically set real time I oh the idea is that you're going to put the storage pool into real-time mode that means that basically once a client has loaded the extents normally it's just doing the file i/o as I said it usually doesn't even care whether the metadata controllers still around it's doing it file i/o it's happy but once the you put the storage pool into real-time mode it goes out and tells all the clients that are using the storage pool that they now need to make requests for i/o so each client will then ask the bandwidth ask the controller say basically I want to do I owe to this storage pool the controller will give it a token allowing it to do a particular amount of i/o for a certain time slice depending on how many clients are asking for i 0 it'll partial partial out different amounts of i/o to the different clients and this is will balance that sort of as you know dynamically as time goes on the important thing is that the person who's reserving the bandwidth is when they make the call they're going to specify a file descriptor that's going to be used for the performance critical operation and that file descriptor it's it will not be limited you'll be able to make reads and writes freely to that file descriptor there's actually another call you can make if you have multiple file descriptors you can make another call to enable multiple file descriptors to be ungated that's basically my session you know you saw how xn can allow you to configure your lunge together in a much more flexible way and and get the performance to all of them out to various clients the important points I want to make our it that the it is a shared file system that's a very important thing that applications need to be aware of and that there are these api's available to to add additional value to applications if you're writing our next Sam system and then I think we will have Q&A oh well more information basically these are the documents that are available on the on the CD and then I think tom is going to come back up for Q&A or Eric [Applause]
