---
title: WWDC2004 Session 108
framework: wwdc
role: article
path: wwdc/wwdc2004-108
---

# WWDC2004 Session 108

## Transcript

Kind: captions Language: en hi everyone my name is Pat Dirk I've been working on file systems and file servers on the Makah ever since the days on the most excellent max i lke supplanted d 512 as the top of the heap so it's been a while and we've come a very very long way we put together a talk to give you some highlights of things that you should be aware of when you're using the file system mostly from an application perspective we are going to go in a little bit of background just enough to sort of put the stage for all these things and let an anchor for all these things that make sense we'll talk a little bit about the phd level api's but also about the higher-level frameworks so we're going to focus on is how to make your applications go faster basically how to make your i/o go faster and how to work better overall with the system particularly in light of some new technologies that we're introducing extended attributes tackles that sort of thing will say a few words about monitoring other ongoing things in the system kind of common areas that applications often get bogged down doing things that are unnecessarily wasteful so we'll try to hit all the highlights and then we'll probably lots of time for Q&A afterwards we'll have the whole team up and you can ask anything you want so a bit of background on the colonel and the bfg parts of the system there are actually a number of different frameworks that people access the API through there's the object-oriented cocoa set of frameworks that people use there's a familiar carbon ap is and the low-level bsd api is that are the foundation for all of them and the thing to bear in mind is that everything that is done in the system is done at some point through some BS dapi so if you don't mind getting down to the really nitty gritty the via the API you can will let you do anything you want to do inside the system at the highest level is the virtual file system layer and that provides a common set of system call interfaces to all other parts of the system as well as providing some common services like pathname translation a certain amount of access checking NFS export servers that are just common to all layers of the system but beyond that it uses an extensible interface to lower level specific file system implementations and works with a buffer cache nvm that I'll say a few words more in a moment because the buffer cache in i think it was jaguar or maybe panther was changed from a separate buffer pool that set aside the vm system and competed with vm for who had the correct copy of something to a unified buffer cache that coordinates with vm to make sure that there's only one copy of your data in the system at any one place and there's a set of code called ubc the unified buffer cache the basically coordinates who owns a given page at any given time and and swap the ownership between the two so the important thing is whatever there is of your data there's only one copy of it floating around in the system now at the lowest level there's a few very simple set of I yo calls there's open clothes readwrite truncate seek just the bare necessities and there's M map which we'll talk about it in some more length later on we have some basic metadata operations you can open a directory for reading you can read through it and get some information on the various items that are in a directory we'll talk about that some more later too because that's another way that you might want to be careful stat is a basic call to get the metadata on an item and then there's a few other calls we extended the basic vse mechanism a bit to support some of the things that HFS had brought to Mac OS 9 when we first introduced mac OS 10 you couldn't quite port all of the existing Mac OS 9 functionality to the BSD ap is that I just mentioned so we introduced a pair of calls get out a list and set out a list that let you do stat and to mod things in a slightly different more expensive expendable way and a new call get their entry shatter that probably few people call directly but it's used by the frameworks in the system to do efficient enumeration of items in the same way that get catalog in folded in addition we introduced search FS which is an interesting API because it lets you search the entire catalog of a disk in marla's block order so the performance of search efest is dramatically faster than doing a complete tree traversal of every single directory in the file system we added a few f.ctrl selectors to let you do things like pre allocating space on a disk and other things like that and new in the tiger release and we'll talk a bit more about that are extended attributes and ackles in general although I said you can do everything at the lowest level in the BSC system you're probably best off using the higher level frameworks cocoa or carbon in the system because the bsd a level of api's makes no attempt to hide all the differences between the different file systems so if some file system doesn't support a certain feature is just not going to work on the BSD call and frameworks like my carbon will provide uniform access to things like resource forks and even emulate them if necessary on file systems that don't directly support them so in general you're best off using the highest level API that you can use they'll say a bit more about the extended attributes extended attributes are new in Tiger the release that you have has only a very limited implementation of the extended attributes they're basically only supported on the HFS file system but in Tiger extended attributes will be supported across all file systems they are however going to be fragile in much the same way that the resource fork has always been and one of the things I hope you'll take away from this talk is that if you're manipulating files in the file system if you're making copies of things if you're transferring things around you should be mindful that like a resource fork they may have extended attributes associated with them and you should be careful to preserve them while I'm on the subject of the resource work by the way if your code still has any references to file name / resource now is definitely the release to get rid of that we had enough we've supported that for a while but it's not the way to refer things if you want to you can use named fork / resource the biggest trick is always doing a safe safe and unfortunately exchange files which was invented for that purpose in Mac OS 9 is not supported on all file systems or all protocols besides which you're going to be have to be you're gonna have to be in general careful to preserve extended attributes and ackles as well if you were here two o'clock today you refer to talk about the ackles and what to do about that we have a few recommendations alas there is no completely safe atomic solution the problem with exchange files even on file systems where it is supported is that it doesn't work on directories so you're always going to have to be prepared to take a fallback option and use something like rename for instance to replace valve in place and we're going to try to make rename work in a way that preserves the extended attributes and ackles of the target that your renaming to just to facilitate doing this kind of safe safe operation now if you're creating new files based on a copy of an old song you should look at using these new api's in some form and the release that you have only have support in the BSC layer they're not in the frameworks right now so that's still to come but the basic idea is that you should use the extended stat calls to get a no peg foul topic and salt descriptor of the top security information and then use one of the new open calls to create a file on disk that atomically has all those akal a tribute set there's a new API being introduced that you should consider using to mark files as busy so that other applications that use the same API can find out whether a foul is currently in the middle of being copied for instance there have been a variety of different hacks that people have used to do that but now there's an official API and the implementation of that uses extended attributes for instance so it's another reason to preserve the extended attributes wherever you can wherever possible if exchange files are supportive and your document format is a single straight data file the best way to do a save save is to do an exchange valve but you're always going to have to have a fallback option and the best thing you can do if you've created the file correctly is to just do a rename in place if your document format is a bundle if it's really a directory with files contained underneath there you can do one of two things in either either create a complete parallel directory in place and then rename one out of the way and rename it not atomic but it'll work and the other option is to create the updated file somewhere inside the top-level director so they're invisible from the user and then using either exchange files or rename switch them in place with the pieces that you're updating and only you can decide based on your application which model makes most sense now some things to be aware of locking support is unfortunately not uniform across the hall file systems it's good practice to lock files if you think there may be contention for but there will be found systems either local or networked all systems that don't support that rename does not support file IDs that was one of the big advantages that exchange files brought is that if you had an alias to a file exchange files would replace the data while leaving the file ID intact and an alias would find it again that's one of the appeals of using exchange files fortunately alias data structures have some additional resolution mechanisms built in that will make it possible to find the file even if the authority is lost because you use rename instead of exchange valve but exchange files are still preferable and as I mentioned rename cannot atomically replace directories in addition wanting to be careful about is that not all networks all systems allow you to rename files that are open so if you have a choice about it it's best to close the file first or at least flush it before you try replacing it and on the subject of that be careful because on some file systems clothes may actually do a very significant amount of i/o and a webdav file system for instance it's not until close time that the actual data is all transferred over to the server that may also be systems that don't actually guarantee that your quota has been checked properly for instance until the file is actually closed and on disk so you should try closing the file before you use it to replace data and if you can't close it at least call s think on it to force a flush of the data and check for errors on both of those before you allow the data to be replaced so a word about file copying and this is actually one of the shortest sections of the talk we've had lots of requests over the years to allow people access to the finder copy engine and that's what we're introducing in Tiger so you'll actually get access not directly to the finders engine but they're the same engine that the finder uses but you'll be able to do copies exactly the same way to find it does and if you can use this new effort file operations API to transfer files around moving or copying that will be by far the best way to do it because we'll guarantee that that does the best job possible on the various all systems involved the API will let you do either synchronous or asynchronous copies and you can set a callback routine to be called at a rate that you specify so that you can do whatever animation you want to do to show the copy in progress or anything like that it's very nice and and that's what the tiger finder is going to be using so that's an example where the highest possible level of API is just ideal so few words about performance you may have more choices about how to do I owe in Mac OS 10 and in Tiger then you may have otherwise there are a few caveats that apply generally obviously the few our iOS you do the better and the more you can aggregate iOS into a few large iOS the better the colonel will to some extent coalesce operations that you do to make fewer larger operations out of a few single one it will do read ahead caching and right behind caching so you may be picking a benefit that you weren't even aware of there is something very specific mostly on HFS about zero fill the UNIX all semantics specified that data that has not been written should read back as zeros and the system goes out of its way to make sure that that is true even on top systems like HSS which don't actually support spar style semantics so what that means is if you take a file and created and skip ahead a 100k into the file and start writing there and you try to read something out of the Middle the system will actually supply zeros for all those areas and write those on disk if necessary if you close the file only having written that last a little bit at 100 k out there it will fill the inter meeting intermediate space on disk with zeros so that's another example of a significant amount if I yo that you might incur when you when you do close a few words about 0 Phil later too now in general there is a couple of different ways that you can do I oh and we'll talk about them all in turn you can do standard buffered reads and writes you can directly memory map a file on disk anywhere on disc into your address space and accesses that way you can do uncashed unbuffered io which is great if you're ripping through very large data sets you have no reason to believe will be touched again later and we're now introducing support for true asynchronous i/o in the kernel as well and that's another option you may want to consider now in the bufferedreader I've case it's the most general of all the mechanisms you can transfer any amount of data to any place you want in your address space there are some costs associated with that the data is first transferred into the buffer cache and then explicitly copied out of the buffer cache into your application space dirtying a page in the process so you're incurring an extra copy and you're dirtying the page but if you're right if you're reading a writing small files and if you have no reason to think that they might not be touched shortly after or if you really need the flexibility of writing a small amount of data in a very particular place in your address space this is the perfect mechanism to use and if you're just doing reads and writes and doing nothing else this is the mechanism you're using by contrast you can if you want to give up control over the location of the data use memory mapped i/o you can call em if you have to call the low-level bsd sausage of all there's no framework level call for this but you can ask a system to map a file in your address space and it will just appear you'll be given the address and you can access the data right there and it will be paged in directly off the file on disk if you're reading in a file this is a great way to read it if you don't care where the data ends up you can just use you a favor and beat copy routine to read your file there's only a single copy made of the data because the system doesn't have to relocate it to the place where you requested it be but everything is left clean so if the pages need to be reused later they can simply be tossed they don't need to be written out to swap space and although you can overwrite data that way and it will be paged out to the file you can't actually extend the file that way that's the only limitation the big caveat on using memory mapped i/o is that if you encounter any i/o errors because say it was on some removable disk and the disk was removed or it was on a network file and the network connection was broken or something it's as if you hit a bad ram chip I mean you get a you get a memory access error so it's a it's a little bit more severe than just getting an error back onto my Oh transfer you may want to keep that in mind and limit this type of operation to local non removable media but with that caveat in mind it may be an attractive way to do I own and it's very clean some of the benefits of doing that kind of i/o can also be had by doing direct uncashed I oh you can call s control in the system to request that I owe to a certain file descriptor is done uncashed and the system will take advantage as much as possible of the alignment of the data to do the optimal possible transfer direct from disk the reason I recommend using 4k aligned buffers for instance is that where possible the system will DMA the data directly from the disk controller into your memory and it will never be copied in in a row the pages are are dirty but it doesn't blow through your cash so if the system cache is filled with pages from your application or pages from other files that you've touched or pages from the disk structures that you're hitting you don't end up forcing them all out of memory just because you read some multi megabyte data file so it's something to think about and there are ways that you can do uncashed IO from the carbon framework for instance show that in a minute async i/o is a standard API that we're introducing support for in tiger it's a complete implementation of the standard posix async i oh you can initiate a sink I'll requests you can get notified by a variety of mechanisms the iOS are handed down to the colonel in parallel so in theory they could proceed in parallel to some extent and if you're calling the carbon TV read for casing for pb right for casing you can specify explicitly that you want the system to do concurrent async I oh by default the carbon framework actually emulates a sink i/o by having a separate thread that is introduced by the system to take care of the iOS while the rest of your process is allowed to proceed and so things are essentially single threaded on the async side but if you want you can now specify that the system should do two concurrent asynchronous i/o and it will get done that line caching is a subtle trade-off that you should be very careful about it seems and it was true in earlier versions of the system that the more things you can keep in memory the faster your application would go unfortunately in the system like Mac os10 where everything is is back through vm that's not necessarily so because just the cost of expanding your memory footprint and introducing extra memory pressure in the system and having to possibly evict pages and write them out the disk to make room for the things you're about to cash then reading in the information and then possibly having to flush that dirtied page out to swap again when it isn't actually being used may really drive up the cost of caching beyond the point where reading it off data in the first place or keeping it mapped possibly would have been a far more efficient trade-off besides that having something cash is no true guarantee that anything will be really fast but the page may still disappear and it may still need to be read in from swap so it's not as if you're somehow guaranteed real-time access to anything unless you're actually going yr a page down or something like that which has performance trade-offs all its own so be very careful when you think about building up large in-memory caches of something because the additional memory pressure you introduced may make the system slower not faster and as I said about 0 fill the system goes out of its way to fill pages with zeros if necessary it also tries to defer that as much as possible it tracks in memory what is actually being written and it's not as simple as skipping around and doing i/o ahead in a non sequential way it will guarantee that you end up doing right of zeros but it does introduce that possibility and if you can possibly write your files out sequentially you'll be that much better off you can use allocate instead of Seti OS to allocate space without forcing zero fills because set eos actually would move the eos of the file out on disk and with 40 fields of areas that you don't actually write as part of that right allocate simply reserves the space in advance allocate is not supported on all file systems you should just ignore any errors you get because the right will allocate the space as it goes anyway but calling allocate is a good idea and it's a better idea than calling set eos finally if the file your writing is just a scratch file and you're writing it randomly in different places so you're going to be leaving lots of holes truncate the fall to zero before you close it because that will preclude the system from writing any zeros out the disk on data that you're going to just throw away anyway so you have a scratch file and you're going to delete it as soon as you close it anyway truncate it to 0 first and make sure you may have skipped any zero filled with the system might have to introduce clean a few words about directory enumeration which is another area that also apps can spend lots of time you can call the raw bsd reader and staff called on a full path you can use get catalog info in the carbon frame work to get equal and information out of stat there are new calls that were introduced the get their entries adder I referred to earlier are the bsd level call a bit the get callaghan for bulk is the carbon equivalent of that call and it will enumerate a directory and collect the metadata you request as it goes the big way that you can save there is a call actually has a field which info which specifies which fields in the metadata you're actually interested in and for some file systems that would otherwise end up emulating some of the data or go through contortions to generate some version or derive some version of the data if you don't actually care about some of the information you can indicate that by setting that which info carefully and if you set that to a smaller set as possible the system will only collect together as much metadata as is absolutely necessary to satisfy the request now finally the file system search FS has a carbon equivalent of catalog search it's an interesting mechanism that if you're going to do a very large search of essentially the whole disk anyway maybe a far better way of searching the data it's surprisingly fast in about 3-4 seconds you can search an enormously large hard disks catalog in its entirety and you can search for matches based on a whole slew of metadata parameters or filename substring or things like that so if you're scanning a volume use search if s don't do a tree traversal of the whole thing starting from the roof so thing to consider so overall and carefully about how you access data and make sure that you choose an appropriate mechanism for the type of access that you're doing or perhaps change the way you access data or write data to better suit the mechanism and avoid zero fill on top systems like HFS there's a great quote that says more things are committed in the name of optimization however illusory than an actual code and that's very true if you're going to set about optimizing your application analyze it first and make sure you find the true hotspots in your application before you go wild introducing some new caching scheme or something because you're convinced that this part of the code is was actually really slow and instead end up creating a much larger memory footprint that cost you in performance when it should have gained Jim if you're porting an app from another system this is a good time to consider how you're actually hitting the disk in the file system and some things that you may think are cheap are in fact expensive and some things that would have been expensive on other systems are in fact quite cheap so it's a good time to revisit some of the fundamental assumptions that you may have in your data and speaking of assumptions don't make assumptions about the speed of accessing some of these things because innocent things like preferences could easily be located across the net on an NFS server that may have crashed a minute ago for instance and what you think is a very quick access to refresh your preferences or something could potentially take a long amount of time so be careful about what you assume now say a few words about security and ackles which are both new and the tiger release you by now probably familiar with the standings unix permissions model that mac OS 10 has supported all this time there's a single owner a single group and three categories of access for each one read write and execute there's a slight wrinkle in that we allow the ignoring of who actually owns a file on removable media and we do that by default but this is the basic model access control lists change all that instead of having one owner and one group and 1 others category you can invent as many categories as you want and you can build up a large list of aces access control entries that specify for individual categories what they're allowed to do or what they're not allowed to do and the system takes the whole list of things into account before falling back to the standard POSIX permissions so it's very powerful and it lets you do some of the things that ap servers for instance let you do where owners and groups can each be so she ated with their own their own group of access a donut or as group if you were at the two o'clock ackles talk you saw that the group mechanism has been greatly expanded the system is no longer limited to fix sixteen groups per user has given up on the idea of trying to pre enumerate the groups that every user belongs to and instead there is a very flexible user level group membership demon that the system consults to determine which users are members of which groups dynamically and cash those results so as a result are things possible that we're not easy to before which is to have groups that contain other groups for instance and groups that are completely determined over the network in a very flexible way it's great stuff so the akal a fee i is based on the POSIX standard proposal that was not adopted and what has been withdrawn but it's very much based on that and the actual permissions rather than being strictly the POSIX read write execute model are a much more flexible much more fine-grained set of permissions that for instance separate control over metadata extended attributes from control over the data of a file so it will be conceivable for instance to have a set of images photos that are read-only documents whose metadata you can control and set your own keywords on and things like that so what you can and cannot do has become much more flexible besides more flexible control over the appearance of a single file there's also more flexible control over the metadata and the directory structures themselves there is ways that you can inherit access control information from a parent to a file that gets created there are separate control over the ability to control items within a directory without regard for the actually controls on the item itself a lot of new stuff there and there's a session tomorrow at five o'clock not sure where about the extended metadata api's there was an ankle session today at two o'clock I'll have references all those later so finally notification is another area that a napkin easily spend lots of cycles perhaps unnecessarily the obvious way you might go about seeing whether something changed if you have a folder that people can drop plugins into or something like that is to periodically poll and see if the mod eight of that directory has changed that will work but you're spending time even when nothing's happening in the system so one improvement on that is the whole FN subscribe if I notify mechanism that's been around but a much nicer solution that we introduced support for is cake use and i'll say a few words about cake use the basic model of cake use was derived from the observation that applications spending lots of time in basically a select loop war or poll the problems are not fundamentally different between the two are very wasteful just in the system call / had to do those things and so k qs were introduced to give a model where instead of specifying like on a select the set of file descriptors that you're interested in data being ready for on the select call itself you would create a cake you and you're given maca file descriptor like an open file would be and then you call k event to specify the set of file descriptors that you're interested in io happening on and then you processes put to sleep waiting for anything to happen on any of those file descriptors and when it wakes up is told something has happened on this all descriptor so just a basic select loop mechanism is much more efficient that way we've implemented a source compatible with the 3 sv version of cake use but we've also extended to mechanism a bit by introducing additional event filter classes and one of the classes is a mach port filter class that will tell you whether some message has arrived on a mach port and another one tells you about things happening in the system for instance whether fall systems get mounted or unmounted anywhere in the system if one of those events happens your your cake you would get an event deliver to it the one thing to keep in mind about cake use and part of the reason why they're so efficient in the kernel is that it's not an actual queue of individual discrete events being notified to you but rather a single notification that some condition has come to be true in the system so you wouldn't get ten little events saying three bytes are available at your file descriptor five more are available on this JavaScript or ten more you would get a single event that says there's data available on this file descriptor so that's something to be mindful of in the model it's not supported on all volume formats because the implementation does require support inside the file system itself and if you are a client to a network file server you should be mindful that you'll you may get notifications but you will only get them about the local events happening going out to that file server not necessarily all the other things all the other clients of that server are doing on the server itself so the two things to keep in mind and when you open a file descriptor on a directory for instance to request notifications about that directory there is a flag you should you should set Oh event only which says that this file descriptor is going we use strictly for event delivering because that will let the system correctly decide at unmount time whether a file system is truly busy or not so to avoid father scriptures that are open solely for the purposes of notification from preventing you from unmounting a file system if you Seto event only on the open the file system will let the UH not go ahead anyway so ol vent only is one to be very mindful of overall unfortunately despite the best efforts of the higher level frameworks there are lots of capabilities that are very specific to individual file systems and if you rely on some particular behavior in the system you should probably check beforehand whether the particular process Tim you're operating on support step and there's a variety of ways that that can be done there are good a duelist variants that will tell you something about the capabilities of a certain volume format at the carbon level there's to get both arms calls but be mindful of it and either test ahead of time or even better be prepared with a fallback option in case the behavior you need is not supported by the particular file system but with that caveat make use of all the process and special capabilities that you want a few word about some of the tools that you all have available one that I find myself using on at least daily basis is FS usage it's a great tool that lets you monitor all the file system activity in the system and you can constrict it to a particular application or you can monitor everything that's happening in the whole system if you want to find for instance who it is is writing a certain Fowler creating a certain something and you can find out the file system call that's made the process that made it what the results of the call where's you can see errors happening if some application is throwing up his hands without any clear indication of what's going wrong you can run FS usage and see what kind of file system errors is getting it's a great tool SC usage of the same thing for other system calls the features have been somewhat expanded in Tiger since the last police you may be familiar with so it's worth checking out the dash F option and if you run with the debug versions of the system frameworks you'll get tracing at that level so you'll find out not just that a stat call is being made but that the application is going get catalog info which is causing the stat call to be made so it's good stuff top is another great tool to get a quick look at so there's a very high level what's going on in the system you see all the running processes you'll see who's using cpu time who's making system calls because you can you can see the context which is happening there's some summary information that will tell you the paging rates in the system the number of iOS that are happening how busy the system is it's great for just sort of getting a feel for a the system's a little sluggish what's going on and if you find the map that's peg did 99% of the CPU time you cut your suspect to go look at time is a great command line tool that will just count iOS and tell you how much I owe is done how much user time and how much system time is taken as part of executing certain process and if you're setting about optimizing your application monitoring both of those is interesting to do sample is great for looking where process is spending its time there are more powerful profiling tools available like shark which is part of the chudd tool that is that you can install but sample especially when you think the process is stuck is a great quick command line tool to get a look at what the current stack frame is in another process and finally lsof will tell you about all the open files in the system so if you can't eject the volume because some files busy helps whoever can tell you who has it open or if you want to find out what files an application is keeping open you can find out through lsof those are just some of the tools there are specific sessions that were devoted to doing performance analysis and profiling of your tools that are very much worth looking into but FS usage is one of my favorites shark definitely worth checking out I think to the session on Friday devoted just to shark for instance so check it out for more information in general you can contact Jason yo and there are also some resources on the DVD and online that will tell you more about the particular file system or in general software design guidelines
