WWDC2001 Session 116
Transcript
Kind: captions Language: en thank you and welcome to session 116 Mac OS file systems the Mac OS is supported multiple file systems for a long time now and now with the UNIX foundation in OS 10 we support whole plethora of file systems and to tell you all about those different file systems and and how you should be using them I like to introduce the manager of the core core OS file systems group clock Warner I'm really glad to see all of you here I know it's early and I appreciate your coming I hope we're gonna have some very useful information for you about the Mac OS 10 file system we'll start with this our welcome and a little idea of some of the stuff that we're going to cover today we're not going to talk as much about futures obviously you've probably heard that as a theme throughout the Developers Conference the present is so much more exciting this year and we're not going to talk as much about file system internals because we know that a lot of you are now in the process of bringing your apps over to Mac OS 10 so we're going to concentrate on some of the things that your apps need to be prepared for when they're using the file system as Jason mentioned we have a number of different file systems in Mac OS 10 and they do present some occasional wrinkles we also are going to give you some tips into how to improve the performance of your app especially with regard when it's to its use of the file system but we will talk a little bit about some of the key issues in building file systems yourself to add to Mac OS 10 if you went to the Darwin overview session or probably any of the core OS sessions you've seen this chart already this is the key you are here graphic if we run an airplane now you know we'd be saying this is the San Jose if you're going to Hawaii get off now we are part of the bsd kernel inside of the file system here's a blow-up of what the file system looks like internally we support basically at the core OS level the bsd berkeley standard software distribution system calls with some extensions inside of the file system there is a big switch we call the virtual file system layer which separates the part of the file system which is dependent on the underlying volume format or network protocol from the part of the file system which is independent so the stuff above the vs VFS layer is independent the stuff below is dependent and that's why you see our list of file systems below underneath the virtual file system switch u FS h FS NFS and we'll talk to you about all the various file systems that are supported in Mac OS 10 here's the outline of today's talk there'll be something of a status update basically an indication of the file systems that we ship we're gonna do a demo of a couple of the new ones we're going to talk only briefly about some of the different file system interfaces we've covered that a number of years and there are a lot of sessions describing the application frameworks for use in Mac OS 10 but we will spend a fair amount of time talking about the differences between the various file systems that Mac OS 10 supports and again how that might affect your application development we'll talk about security because that's been a large issue for a lot of folks Mac OS 10 is a multi-user system with perleval file permissions and in that way it differs greatly from Mac OS 9 and you need to be ready for permissions our errors potentially in your applications we're going to talk about performance considerations as I mentioned and we'll talk a little bit about building new file systems first status slide I think this is my third or fourth Worldwide Developers Conference session talking about the interesting things we were going to do in Mac OS 10 and the recent things we had done in interim build a B and so forth this is the first time I get to say this I am NOT above cheap applause this is why I love this crowd alright so Mac os10 we have three primary file systems and when we say primary what we mean is they're fully read and write file systems and we can boot and route off of them so we can boot and route off of the Mac OS extended format file system best known to those who know and love it as HFS+ we can boot in rudolph of ufs that the UNIX file system that we support based on the Berkeley fast file system and we can boot and Rudolph of NFS in fact we do that internally all the time for installs we also have some readwrite file systems that we don't boot and Rudolph of but are otherwise first-class citizens and that includes the Mac OS standard format otherwise known as HFS for legacy data that you may have on your Mac the Apple file protocol a client was delivered by the server team into Mac OS 10 we have support for an estas file system and specifically we mean fat16 and fat32 here which we did largely for digital cameras and so forth but also zip drives and removable media coming from your Doc's machines and WebDAV and we'll talk about web dev in more detail a little bit later on in the talk there are also some read-only file system supported in Mac OS 10 including ISO 9660 which is used on lots of CD formats especially to interchange between Mac and Windows we support the universal disk format actually I don't think it's call bet anymore I think they just use the initials UDF sort of like KFC and Kentucky Fried Chicken and CD da FS a file system written by the CPU software group to support CD audio files CD audio drives CD audio discs rather about music ISO 9660 I should mention we can also boot and root from I didn't mention in the primary file systems because it's not read and write so I'm going to do a demo I'd like to bring up demo machine number one and I'd also like to bring up my lovely and talented assistant Scott Roberts you might have seen a demo similar to this in the keynote avi got his digital camera up and took a picture of the audience I'm gonna have Scott do the same thing here thank you Scott now it turns out I'm not a vice president and as a result I don't I don't make as much money as avi - Vania and so I can't afford a really nice fancy digital camera that has USB hot plug and so forth and so on I have this camera that was given to me as a gift by a friend of mine actually a kodak DC 210 plus and but the nice thing ello doesn't have USB connectivity and all that it does have this little compact flash card that it uses for memory storage and we happen to have a SAN disc reader here attached to our demo machine and so I'm going to put this little card in here let's pray a little bit and you'll see the image come up on the system now we're doing something differently than avi did because we're not here to show you image capture or any of the fancy camera applications I just want you to know that this little compact flash card is formatted as an ms-dos disk and so we're looking at an ms-dos volume that's just loaded on our desktop and if I bring up the preview application and I go here to the image that Scott just took and open it up there's a picture of you folks I must have said something wrong I don't know what okay for our other demo I wanted to show you the web dev file system I've got over here on my other demo machine I pre-configured it as a web server and I actually changed the configuration file so that it would run the davit actually ships in Mac OS 10 as part of our Apache installation so host 2.1 62 is running web dev and i'm going to bring up internet explorer here which I should have pre-launch sorry to waste your folks this time and I'm gonna type in oops the URL and you can see we have our little Apache page now what I'm going to do is since I'm bored with looking at things in web browsers I'm going to mount this particular website as if it were a volume on my desktop I do that by hitting command K going to the connect to server dialog and typing in the URL here instead and now you notice I now have the Apache web site actually mounted as a volume of my desktop well why is that interesting let me show you one thing that we can do we took a movie actually last year's file system session and we edited it and made a small movie out of it and we put it on the web server and now instead of going to the browser and trying to deal with the QuickTime plugins and so forth I'm just gonna double click that movie and we'll show it for you now oops let me back it up a little bit here thank you the stage police are watching me and I've heard from my spies that when I turn around to jump on the stage something Bad's gonna happen to me I have to be careful I've noticed that the stage is higher this year than last so I think I'm ready I'm getting older I can't do this kind of stuff every year okay let me show you one more thing that I think is pretty interesting I'm gonna go into BBEdit here now I haven't actually completely rehearsed this down I was fooling around with it before the show but I thought it would be kind of an interesting thing to do so I'm gonna bring up BBEdit and I'm gonna open up a file on the WebDAV server and it's going to be the english HTML index file and I see some text here that says let's see if you can see this it means the installation of the Apache server went okay I don't like that text the text I like is file systems rule I like that much better now I'm going to go to the web server here and refresh and you can see file systems rule up here in the top so an interesting way to edit your website is to enable dev on it and then you can just use your favorite editing tools and do whatever you like inside your favorite editing tool when you save it's automatically populated right back up on the server you know I had to have a plant for that for that applause okay let me talk just briefly about some of the file system interfaces in Mac OS 10 this is the little file system interface chart you can see the three major application environments classic carbon and cocoa on the top they all have their own file manager or file object interfaces and but they all come through the BSD layer and we've extended the BSD layer to allow access to filesystem metadata that's not typically available in UNIX especially catalog information finder and create a file type and creator type and things that you find on HFS+ and underneath all of that is our virtual file systems which I mentioned before and that's where additional file systems that you might develop in Mac OS 10 or stacks would layer into our system so I won't go again it's detail here but suffice it to say the carbon file manager is still available to you it has the mac OS interface file system interface is carried forward there's access by volume reference number directory and ID still available and carbon does an interesting thing Carbon provides HFS style semantics on file systems that don't normally support it so if you're using ufs but your app is using carbon you'll still see resource Forks and you'll still see file IDs the file IDs don't last across mounts they're they're done in memory and kept for you only while the mount exists but nonetheless they're present if you do a call and you want to see them so you're somewhat insulated if you're using carbon from some of the differences across file systems we also of course have the cocoa environment with an object-oriented API and some file objects that you can use and as I mentioned the Berkeley UNIX interfaces are available and can be used by your app as well even in a mixed environment with the other application frameworks we want to just briefly mention a couple of the new calls that we added to the BSD layer for accessing HFS file metadata and sorts of things that aren't typically available under UNIX two major ones get a tour list and set out our list which roughly stand for get attribute list and set attribute list these are flexible calls designed to retrieve various types of metadata various different formats back to your application normally you wouldn't call these you'd probably call get cat info said CAD info and carbon but they're available to you if you need to they are in our system we also implemented a call called search FS which whose job is to do fast catalog searching it's supported in HFS+ although most file systems don't support it but it does allow for fast catalog searching and it was designed to support the PD cat search functionality and carbon we also have a call for exchanging data between files those of you who've done carbon apps before are probably familiar with the exchange files call we created a BST level call called exchange data to do that same atomic transfer of data between two files and finally we've added some options to F control which is a standard UNIX file system call but we have some extensions to allow behavior like allocating storage in advance of the elio F those data that's not part of your file per se but is part of the file allocated storage which can be extended later without additional allocation now we're at the file system differences part of the talk and I'm gonna grab a little cup of water here because this is gonna be a little longer and this stuff is important there are a station mentioned many file system supported Mac OS 10 and while Mac OS 10 tries to be as file system agnostic as possible the underlying volume formats and the underlying network protocols do impact the behavior of file systems and in some ways it's just unavoidable and your app may need to be ready for it so we wanted to give you an idea of some of the differences that you'll need to watch out for kind of a set kind of a you know warning about some stuff the first set our naming differences our file systems often behave differently with regard to name and two of the biggest issues are the name lengths that the file system will support and their whether or not they are case sensitive or case insensitive as a lot of you know HFS+ is a case insensitive file system the Mac has historically worked that way and people are accustomed to it but the UNIX file system is historically case sensitive and people who are used to using a UNIX environment are accustomed to that and we wanted people to have an option in what kind of file system that they supported because some of these differences are actually valued by people especially ufs for example supports holes so you can have very large files with not so much data storage and it doesn't take up much room on your volume HFS+ doesn't that maybe Han hfs+ supports full unicode characters anyway well we'll talk about some of those differences as we go access by persistent ID or path all of our file systems support path based access some of our file systems also support access by identifier and HFS+ of course is among the file system that supports access by identifier we also have a call that allows for an error to be returned if you try to delete a file while it's held open deleting an open file has historically been a no-no on the Mac and it's historically returned an error however unlinking which is the delete call on UNIX a file typically does not return an error even if the file is open and we support that semantics but we wanted Carbon to be able to get an error back if someone tried to delete a file which was open because apps are expecting that behavior so we built a special call we nicknamed it delete I think it's system called 227 and it just checks for a reference and returns an error likewise some of our file systems full Unicode support some local bytes some utf-8 and some of our file system support permissions and some don't and what I mean by that is the volume format or the network protocol will have provisions for permissions data to be stored and or transmitted some file systems like HFS for example this dental Mac OS standard format have no storage for permissions information and so we will use default permissions for the entire volume and often substituting the logged in user as the owner for all of the files on that volume with a default permissions man some of our file systems support hard links which is the ability to create separate nodes in the file system pointing to the same data where the nodes are essentially equal not an alias from one to the actual object some of our file systems have storage for catalog data catalog information some of them don't some of our file systems are multi-fork some of them are I wanted to talk to you about web dev in particular because web dev is a really good way of highlighting some of the odd differences you might see between file systems it's interesting that we have this sort of file so the Magnox tech architecture because it allows us to do things like mount to web server as a file system but there are going to be some gotchas the WebDAV protocol does not have any sort of dates besides modification dates so if you wish to get the access date of a file on the server it's actually impossible we have to sort of fake that information so that's a one classic difference inode numbers aren't a concept that is supported in the WebDAV protocol either webdev I should mention stands for web-based distributed authoring and versioning the reason it came into existence was to allow collaborative authoring on the web I think it's in visual it's a sorry original envisioning was for people who are doing development in a web-based authoring tool to be able to move things to and from the server and affect the files on the server but it wasn't originally a file systems protocol that's something Apple decided we could do because the protocol added enough support in the way of a consistent hierarchical namespace some synchronization with locking and property management but those properties do not include inode numbers and so very much like carbon and file IDs when we see a file in WebDAV we generate an inode number and we remember that inode number for the life of your Mount but if you unmount the web dev volume and mount it again the inode numbers for files will not be the same also we can't set live properties in WebDAV through the protocol live properties are the ones that are actually supported by the server there's also a notion of dead properties in web dev which are you can make up a property and store value in it but real live properties like for example the access time those those properties are actually on the side of the modification time those properties are actually on the server and we want to respect those and return those to you but the server's will not let us change them so you can't do a set mod time on web dev and have it work we'll have that silently failed to keep apps from killing over but be advised that you can see a silent failure of a set mod time the security model for web dev is entirely different from what you'd expect from a file system it's HTTP security it's basic authentication for us generally which means that you try a request and if the server doesn't like you it gives you back a message that says authorization denied and it's your job to find out the users username and password and try again sending it across but there is no way except for testing an application to determine if the user is going to be able to do that operation or not if you send a put across to the server which is the mechanism for taking a file and moving it up you don't know if it's gonna succeed or fail there isn't a pre-flight call to fuel you just have to do a put so what happens in WebDAV is if we get an authorization error the daemon that supports the system puts up a dialog box that says who are you please type your user name and password in and then we'll send that across to the server we'll keep doing that until the user either gets it right it times out after about five minutes or the user hits cancel if the user hits cancel Annie access error comes back from the filesystem likewise unlike AFP or NFS which are typically run over local area networks and are therefore usually reasonably fast maybe not compared to local Williams but in the absolute sense WebDAV can be quite slow it may be running over 28k modem link you never know because we're talking about an Internet file system after all so we're not going to go through all of the items on this chart I just wanted to scare you a little bit to give you an idea of how some of these file systems actually differ classic example hfs+ supports privileges it has storage in the volume format for privileged information supports the ability to get an error back when you delete an open file supports access by ID ms-dos supports none of these things WebDAV supports none of these things either NFS is sort of a mix same story with naming differences some file systems are case-sensitive like u FS some are case insensitive like h FS plus some support unicode fully and some don't we make a particular mention of the Unicode characteristic hfs+ supports unicode names in a canonical form where the characters are decomposed it's possible in unicode to have say an e with an accent represented as u with an accent as one character or e followed by an accent character there aren't that many decomposed characters but there are some we always store all our filenames decomposed on the volume so that there will only be one element in a directory that looks the same to the user and so that will easily be able to do name comparisons but not all file systems do this ufs does not ufs does not interpret the bits and we wind up storing things on ufs as utf-8 characters and so on you FS you actually could have two names that look identical to the user but are actually slightly different in their byte representation what this means for your application though is if you have a composed character in a name that you send to a create call when you look at it again through a directory listing at HFS+ it's going to be different and on you FS will be precisely the same so having said that let me give you some tips as to how you can handle some of the specific differences we've talked about today in your application number one be consistent in your use of case internally we found with the Macintosh Applications environment which was a Mac environment that ran on UNIX operating systems that lots of Mac apps would keel over when ma II was running on a case sensitive file system and that characteristic bled through reason being they would have files like preferences that they would open in one part of their app with a capital P and they would open in another part of their app with a small P and they wouldn't be the same and that would confuse the app HFS+ they would be the same ufs they wouldn't be you'd either see a different file or you get it not exist error when you tried to access the file but different in case so make sure that in your app when you're referencing a hard-coded file if you do that that you're using the same case in all instances always use D composed names in your application that way you'll never be surprised by a name being slightly different on the way back out than it was on the way into the file system when you created it be prepared for access errors at random times as I mentioned the web dev file system has a very odd permissions model and we're gonna do things like put a file across which may happen in an eff sink or a flush files operation or on a closed operation and we're gonna discover at that time and not in advance that the operation isn't permitted if the user doesn't have a username and password that allows them to do it or if the server administrator has cut their access your you can get back any access error on a close call or on a flush files call with something which probably hasn't ever happened to you before so be prepared for access errors at strange times and you may want to be able to put up a dialog that says access denied you're going to probably see them if you're running a Carbon app as an AFP permissions error because that's how Carbon maps are access denied error codes also do not rely on inode numbers and this is also true for file IDs across mounts if file system is unmounted and remounted all the inode storage on a web dev volume all the file ID stores on ufs volume is kept in memory in big tables and we'd there's no effort made to make that persistent across different mounts we have one way for you to help you deal with some of these differences and that is the path comps simple path system call path comp is designed to give you characteristic information about the file system you're running on it just takes a path and you give it a selector that says you'd like to know if the file system is case-sensitive or if the file systems support how long the names are that it supports those are the only two selectors that we support right now in HFS+ on path conf perhaps someday we'll expand that list but you can use this call especially in HFS+ to know that you're on a case insensitive file system so I'm going to now bring up Pat Dirk's the core OS file systems technical lead to talk to you about some security issues and some performance issues in Mac OS 10 thanks Ari well the first thing to realize is that in Mac OS 10 it's a whole new world it is multi-user to the core their permissions everywhere and the core kernel enforces those permissions there's no path around it or some access path that's gonna be different that's not going to be affected by it the whole system is fundamentally multi-user the permissions in the system are the standard UNIX permissions if you're familiar with those the next few slides are just going to be review for you hold on there are a few gotchas in the permission handling of HFS but if you are familiar with UNIX you should be very comfortable with the permissions model on our system and we'll see that diagram again in a moment the permissions for those of you who may not be familiar with UNIX are in some ways similar to Apple share Apple shares permission when we were designing the AFP protocol are based on the UNIX permissions model and we made a few changes from there to allow them to work on a on a folder only basis but they're fundamentally inspired by the UNIX model so instead of sea-fowl see folders and make changes you have read write and execute and you have that for files and you have that for directories the catch is it applies to files as well as folders on Apple share file server the only permissions you ever had to worry about was the permissions you had on a given folder in Mac OS 10 you have to worry about the permissions on individual files as well and the other differences with AFP is that in AFP you can have separate permissions for the world the group that a particular folder and the owner of the folder and whichever category you fit in you got those rights so everybody started out with the everyone permissions and then if you were part of the group you also got the groups Commission's and if you were the owner of the object you also got the owners permissions in Mac os10 there's only one group that is matched to if you're the owner you get exactly the owners permissions and if you're in the group you get exactly the group's permissions and if you're everyone else then you get the other permissions so those UNIX permissions consists of an owner ID that is saved with the object a group ID that is saved with the object a set of permission bits which you see there and a few extra bits that are not divided up in separate categories for owner group and other and we'll cover in a moment what exactly those bits mean in different cases but that's basically read/write/execute in three groups and three special bits and some flags that we'll cover in a moment as well so so there every user is categorized in one of three possible groups either the owner of the object the group that there is associated with the object or everybody else and whichever group is most specific determines the access as you get as I said I can unlike Apple share so for each group there is read write and execute write translates very directly to make make changes in AFP read is the right to read a file or list the contents of a directory it's a bit like C file see folders that's where it gets a little weird and execute applies most directly to files obviously if it's an executable and you have execute permissions then you can execute it it also applies to directories as we'll see in a way which is kind of a subtle case for files if you set me set UID bit then when you execute that binary and it only makes sense on executables the program will run with the ID of the owner of that file so you'll commonly hear set UID root binaries those are files that are placed on the system that will run as a privileged user in the system when they're executed and there's also said GID which is used less often which runs with the group that is associated with the object now in directories it's probably easier to make sense of these permissions if you think of directories as files listing the contents of the directory because that's how they were originally came about read is explicitly the permission to enumerate the contents of the directory to type LS and list the contents right is the right to make changes it's very analogous to a FPS make changes this is make changes to the directory so these are operations that would require changes in the directory file creating a new file renaming a file that's in there deleting a file that sort of thing you can set execute on a directory and that limits access a little bit without read normally you have read and execute together if you give execute without read you retain the ability to open files in there provided you have permissions on the file itself but you lose the ability to list the directory contents so it's almost sort of a security by obscurity if you know what the file name is or your program has built in you know some file that it's referencing execute is enough to get it open but you need to read in order to enumerate the contents and look at it so that's that's sort of an edge case that you may run into and it's it's unusual finally setting the sticky bit one of the special bits that is associated with an object means you can give right but the ability to actually make the changes is limited to the owner of the object itself so it imposes one additional test before you can actually make use of the write permission that you would otherwise look-look to be granted now there are a special group of flags associated with every object as well and these exist only in one form there's not a separate group of flags for own your group and others there's only one set of flags and most common one you'll see is the immutable flag that takes the place of the lock bit that HFS always had and that is in fact what set F lock and reset F lock in the carbon interfaces news to lock or unlock a particular file you can see these flags if you use the - o option in LS if you find yourself in the shell it will list you change if you if the immutable bit is set so that's a quick way that you can tell if things are locked and there's a chi Flags command that changes the flags and there's at your flag system call that will manipulate them at the BST level all these things are accessible at the bsd level there's nothing in carbon or cocoa or something that is that a special above and beyond this everything is enforced at the BSD level and everything is accessible at the BSD level the other gotcha you may run into is that when something has been marked immutable it can't be moved that used to not be true on Mac OS you can lock a file and still take it and move it somewhere on servers that was actually sort of an awkward thing to do because you could lock something down and somebody who had to make changes to make it disappear from you that's no longer the case when something is immutable it doesn't go anywhere and it doesn't change now these flags have the immutable and the append Flags have special variants of them that can be set only if you are especially privileged user and it can't be unset in the normal running of the system so if you are trying to protect some particularly important file in the running of the system you can set a special system only immutable bit that is sort of stronger even than the regular immutable bit and that you can't turn it off so be careful if you try this on your machine at home you have to take the system down to single user before you can clear that bit know all the UNIX aficionados wake up this is the part where things get different again there is some special handling on permissions for each of us plus volumes we had a problem in that we wanted people to be able to take discs and move them all around from system to system and retain the same ease of use that they had in Mac OS 9 they could take a zip disk from your system take it over to somebody else's system and you wouldn't suddenly find that the permissions were all wacky just because the numbers that were assigned for the user ID and group ID on your system made no sense on the other system so the system very carefully uses the permissions only on those discs that it knows are local or were specifically requested that they be used by default if you have an H of s plus disk that the system has never seen and you connected either by plugging in a zip disk or by plugging in a firewire drive even the permissions will fall back to a scheme where the owner and the group are ignored and you can get that same behavior on request for any system for any disk in the system through the finders ignore permission as a bit I'll talk about that so every every disk is identified not by name but by a special 64 bit identifier that we write on there when a disk is being mounted the HFS code checks to see if this ID is one of a disc that it has seen before and for whom permissions should be enabled and if it finds that then it will enable the permissions and it will be used just like you would see a you of fastest you'll see owners you'll see groups you'll see everything if there is no entry for that system if it's completely unknown or if the entry in there says the user asks that the permissions not be used then the handling switches over to ignore all the user and group IDs on there make them unknown and replace the owner with the login user and that's done completely dynamically if you have such a disconnected you log out somebody else logs in they are now the owner of all the objects in that system so it's not a static mapping it's whoever is currently logged in owns all that so it's a very convenient way to not trip over user or group settings that make no sense on your system so the ignore permissions checkbox in the finder lets you elect to ignore this and get the same sort of foreign disk behavior that you get and it's the same underlying mechanism what the ignore permissions bit does is basically turn off the recognition of that disk in the system and it will treated without regard for the users and groups it's called ignore permissions really the best way to think of it is to think of it as ignore ownership we'll take questions and answer questions on this later I wanted to bring up a few points about performance in the system and in particular the different ways that you can do I owe a few general words that will touch in a moment but I always say we want to cover the differences between doing buffered filesystem i/o doing direct memory mapped i/o in the system and using unbuffered filesystem IO and the differences between them and the implications of those things I'll say a few words about zero fill which your application may have run into which is something that you see on Mac OS 10 that you never saw on Mac OS 9 before in general this shouldn't be news to anybody the few where iOS you do the better the more you can aggregate Jerry OS into a few large operations the faster things will go even if you're doing small transfers the system will try to aggregate these on your behalf if you're sequentially reading through a file the system will pick up on that and it will read larger and larger chunks even ahead of where you currently are and as you're writing it will it will save up rights to do single large writes out to the disk to maximize the efficiency of your i/o so that is why sequential operations are so much better than random operations because even if your application is only ever asking for 4k at a time you'll be doing very large transfers to the disk and the zero fill that we'll cover in a moment is triggered when you are leaving areas of the file unwritten but you do become the owner of them and it's best to avoid that because it's really just wasted effort you might as well write the data sequentially don't skip ahead of the end of file for instance so this is basic buffered file system i/o you see the device driver in the system which is doing the actual data transfer the buffer cache and the virtual memory system which are part of the kernel that govern all the data in the system and in Mac OS 10 those are actually integrated and they coordinate with each other so where there are cases where a particular piece of a file or page in the system is the same the buffer cache and the virtual memory system coordinate access to that page so there's only ever one copy of the page that means if you have something mapped and you write that you'll see those changes in the mapping right away and vice versa if you if you if something gets paged out the readwrite path will see that right away it seems obvious in hindsight but it's an awful lot of work to make that work correctly and finally there's a user application drawn there and you see the user application with a page of memory and there that's really the appearance of a page that has managed on the user's behalf in the virtual memory system you can think of it either way you can think of the page as owned by the user or you can think of the page is managed by VM on the user's behalf it's really all the same thing so in basic buffered i/o the data is first copied from the device driver into a buffer that is set aside by the file system to hold blocks for that V node and those blocks are shared between VM and the buffer cache as necessary so that's the first copy then the filesystem will copy whatever the user asked for to be copied either read or written into the user application and there's a second copy made into the user pages that hold the user buffer so it's completely flexible you can read at any offset in the file you can read any amount of data but you do end up making two copies first a large page aligned copy for the convenience of the system into the buffer cache and then a separate copy from there into your user address space where the data really lives and by the way the user page ends up being dirtied as a result and we'll see in a moment why that's important but if you're going to be reading a file over and over or you're reading back and forth through a file it's a cost that's well worth making in addition to the flexibility you gain from the ability to align the data or the size anywhere you want the fact that the copy remains in the buffer cache means the next time you hit either in that same page or somewhere right around there you'll probably find it in memory and you'll only end up doing the last copy from the buffer cache into the user page so there's an extra copy but it may be worth it under some circumstances and this is probably what most of you are always here I always like this is ordinary open closed readwrite i/o now instead of that you can do memory mapped i/o and it's something that you should consider as an option when you're just reading files it's a very efficient way to get data in and are some advantages although it requires a B as the VM call to set it up so it may be tricky to do from a CFM Carbon application it's a very nice way to get the data in because you only end up doing a single copy essentially it goes straight from the device driver into the VM system and from there it's visible to you user application so there's only one copy mate so we save a copy in addition the VM page is not marked dirty all that's ever happened to that page is something was read into it unlike the user page that was copied into a moment ago so when the system needs more pages it doesn't have to go copying that user page out to swap storage it's all set it can just throw this away and I can read it in later if it should get page faulted in again so there are some disadvantages every transfer is at least a whole page worth so if you've got a file with ten bytes of data in it that's obviously not worth mapping it but it's a nice way to get some data in and read through a lot of data the VM system does the same clustering of i/o operations that I mentioned earlier as you're touching through pages it will start paging in more and more in advance so it's a very good way to read in a sizable data file that you're just going to read sequentially it's not good for right because you can't extend the file by mapping it but it's a very good way to read data in that you're only going to read and that you're reading sequentially and it'll save you a copy now finally there's something that's almost a mixture of the two you can choose to do unbuffered IO and it's actually very easy from carbon because it's the exact I Oh pause mode no cash bit that you can set on a reader right transferring carbon and if carbon does the work to the system on your behalf if you're not using carbon you can make s control calls to enable this mode for your IO but it basically skips the intervening buffer cache altogether and the data travels directly from the device driver into your buffer now that imposes some of the limitations that the file system previously took on on your behalf on your application so the data has to be page aligned it has to be a multiple of pages but if you're you know reading in a QuickTime movie that you're going to play once and never necessarily touch again it would be a waste to fill up the buffer cache with all those pages and it's perfect to just read that indirectly that way you have total control over the amount that is transferred in a single transfer so if there's something about your data that you know that the system wouldn't know this is a very good way to do it if you need to grab a whole frame of data or for some reason know that 64 K is exactly the right size to transfer and you don't care until you have a full 64 K available this is the kind of i/o you should consider unlike memory mapping the page the page is dirtied just like ordinary i/o would be the page in the user space is marked dirty because it's been copied into and it will be swapped out if necessary but it's a good way to do i/o and not fill up the buffer cache if you're not likely to read or write to read the data again it's a good thing to do and you can write files this way so if you're writing an output file that your application isn't just about to reread this may be a good way to do your AO now the zero fill I mentioned the Mac OS 10 kernel tries to be very careful not to let you read data that you haven't previously written if you would call cases where some major word processing application would inadvertently shift pieces of your hard disk out along with your documents you'll see why this is a really nice feature you have to be careful though because if you have a file that you're writing randomly you'll end up if the first transfer is some distance into the file you'll end up 0 filling the whole intervening space basically anything that you can potentially read you should consider the you should either write where the file system will write with zeros on your behalf as part of the write transfer so those cases are basically where you you said EOF to make the file larger or where you do a write that skips ahead past the end of file some distance and starts a transfer there creating this gap that gap will be 0 filled so for those reasons sequentially writing a file aside from all the benefits I mentioned earlier of clustering the i/o is is far preferable no word about the cost of caching you should be careful when you decide to cache data in your application because in Mac OS 10 you are constantly running with virtual memory enabled and what you think of as setting aside some memory for this particular cache is really just that much more paged memory in fact you may end up doing a number of i/o operations just to read the data and you may have to page out some other dirty page in the system to free up a page for your cache you end up incurring the cost of the actual transfer to read the page in and if this day that turns out not to be referenced you may end up having to page out a page and you dirtied by this cache in addition you have to be very careful about how you structure the cache this is not wire to memory that's sitting there for your behalf if you have a cache data structure that is just laid out very conveniently in memory but ends up skipping around from this page to that sort of randomly you end up touching all these pages and you may end up doing page ins with every new page that you touch so you were very careful that you structure your cash in a way that minimizes the number of potential page hits to get you to your data altogether it's very easy for an application cache to become much more expensive than simply reading the data right back in from disk especially if the data is something that is mapped directly into memory for instance so think about it carefully and only use caches for things that are truly hard to reconstruct or where you are sure that the hit rate is actually very very high so finally Mac OS 10 is a good time to rethink some of the fund elías options behind your application think about the kind of data that you're reading and the pattern that you're reading or writing the data in and think about what mechanism you might best use to get that IO in and out of the system look at your application as it's happening and figure out where the real bottlenecks are before you decide where to spend your time and effort and trickiness and what to optimize if the bulk of your application is reading and writing files it's obviously worth thinking about if the bulk of the time is spent waiting for the user to click on some cell somewhere or something that it may not be an issue at all look at the underlying assumptions that went into your application because some of them may well be changed in Mac OS 10 some system calls that used to be almost free on Mac OS 9 because they came straight out of memory all the time may be reasonably expensive on Mac OS 10 all of a sudden and again that's the reason to go back and look at your application in action and see where the time is being spent because you may be surprised to find that you're spending a lot of time doing things that you assumed would be almost free and finally try to avoid making assumptions about how fast something will be to read because you might be surprised what's actually somewhere remote over on a network in somebody's home directory and the preference file you thought was cheap actually turns out to be a very lengthy operation that might involve automatically mounting some volume getting access to the data etc so don't make assumptions about what's fast what's local what's remote it could be on a web dev volume for all you know so finally there are some tools that you should look at there are some classic UNIX tools top is a very nice tool for seeing the size of your application the amount of virtual memory that it has allocated to it how much of that is shared how much of that is private and it gives you a little peek into the system and will show you how fast paging IO is being done how busy the system is what it's doing what in your system is using the most CPU time all kinds of things I recommend it highly you should you should run it off there's a time command which can be very interesting it's limited to command-line things but it will tell you how much system time and how much user time was spent executing this particular application so along with the number of i/os that were done on behalf of your application so you can easily tell when your application suddenly starts doing fewer reads or fewer IO transfers or more larger ones or smaller ones or whether the percentage of system time versus user time is interesting if the system is spending most of its time in system time you should think about what system calls it's doing to cause that to happen and similarly don't worry too much if most of the time is spent in the system because your applications algorithms may not be as relevant so time can be interesting sample is a gather a long standing next step tool it's it dynamically probes your running application and takes a peek at where the system is currently running and the stack at that time and you can tell it to take a number of samples over a certain period of time and it will tell you what percentage of time was spent in what routines and that may tell you where the hot spots in your application are tell you whether your application is constantly waiting for IO to come off disk or waiting for the user to do something or all sorts of things so sample is interesting and finally FS usage which you may have seen demoed in other sessions as well is a wonderful tool for getting down to the real nitty gritty of exactly what your application is doing and what the system is doing on behalf of your application you may be making carbon calls and be unaware of the number of system calls that go on under the covers to make that carbon API happen so at this point I like to bring up our resident expert in bad demo code my manager Clark Warner Thank You Pat all right let me pull down an explorer here and BBEdit window all right I'm gonna bring up a copy of TextEdit which you've many of you imagine it probably used by now and let me bring up a copy of the process viewer no I have to do this the are we all right okay first I'm going to do my little UNIX command here to find out the pin number of the process that is TextEdit and it looks like 278 let me change the font here to make this little bit more readable for you that's probably better okay this is not a command we want you to run too much at home I just made myself route what's that oh thank you okay now I'm now monitoring all the behavior of TextEdit and when I go back into whoops let me bring it back you can see as I click around various things are happening one of the things I'll do is open up a file that I put on our demo volume you can see a lot of things are happening now there's a 116 demos data file okay so here's my opening of the data file you'll notice there were a few page ins of some open some F stats some reads but basically one read call of a fairly large size so that's not too bad I'm gonna close up this file here let me just show you oops here we go again sorry you think I would have woken up by now the man page for SF FS usage one of the most interesting things about the FS usage program is the ability to see all of the actual carbon file system calls that are happening while the read calls are happening if you notice here there's this temp file tracing if you create in slash temp this file called file tracing then you will actually see all the carbon calls as well as all of the bsd calls that are coming through and to show you that briefly I'll turn that on and now I'm going to launch an application that I call dumb text dumb text is the standard simple text text editor hacked up to do one byte file reads just to give you an idea of if if you guys wrote after this way this is how you could figure it out before your boss does let me talk you a little bit about some of the key issues and building your own file system one is it's really not recommended and the reason we say that it's not because there's some you know fundamental thing wrong with the with the api's or anything like that but what the basic issues are kernel extensions like in Mac OS 9 have the ability to create kernel instability and while I know all of you right perfect apps that's not true for the people that don't come here and so we want you to bring the word back to them don't try to throw stuff into our kernel unless you absolutely positively have to and if you do you're gonna have to contact us building a file system extension requires deep internal knowledge of our kernel not just the VFS stack but also the various calls you might make to the kernel to make to use kernel services in your system and they change and we change as we change internals in the kernel think there are things in your file system that may have to change as well and so basically if you write a file system extension your rev lock to our kernel if you went to Deane rhesus talk you'll note he talked about kernel versioning if you don't version your kernel correctly in the future we won't load your file system extension if you don't version it at all in the future we won't load your file system extension so not only is there the implicit rev lock because kernel interfaces are changing internally and you may be using those interfaces but there's the explicit rev lock that if we know we've changed those kernel interfaces we will change the version of the kernel what's more which ain't will change the version of the kernel that that kernel is compatible with and your file system may not load I want to give you one of these changes that was made recently inside of our kernel something that happened between public beta and Mac OS 10 GM to give you an idea of what kinds of things were doing we wanted Mac OS tend to be a fully preemptable system the Mach microkernel was already fully preemptable but the bsd kernel was not historically in bsd when you make a system call you run all the way until you reach a voluntary yield point that is to say you do i oh you try to acquire a lock allocate memory and so forth or you'd run all the way to completion so we invented a mecca I'm called funnels to wrap all the BSD code so that that assumption of non preemption would would be held inside of the code but otherwise bsd system calls could be preempted funnels are required when a thread enters a system call they're released when the thread returns to user mode they're also released when a system call reaches a voluntary yield point like IO allocating memory and so forth but they're held across kernel preemption so a bsd system call now can be preempted in the kernel and something else can run a user thread or a mock thread or an i/o cue thread and so forth but the bsd structures won't change out from underneath the bsd kernel system call so it's happy we also split the funnel after we developed the first one so that now networking operations in the kernel are handled in the network funnel and all other operations including file system operations are handled in the kernel funnel we found that we actually could separate network activity in the kernel from filesystem activity in the kernel what this means though is if you are writing a network file system say every time you went to use the networking infrastructure in the kernel you'd have to change your funnel switch from the kernel funnel to the network funnel and when you went back you'd have to switch back and switching funnels is a blocking call and so the entire world can change from out from under you when you switch from the network funnel of the kernel funnel and vice versa all things you would have to know network funnel is for things like socket IO and find and accept calls and so forth kernel funnel for everything else and there are some calls of course that can be called either from either funnel or from no funnel memory allocation and free etc so here are some need to knows if you wanted to build a kernel extension for Mac OS 10 one as we mentioned last year we built this thing we call the unified buffer cache which if you had built the file system prior to that would have had to change to support it likewise between public beta and now we introduced the split funnel and of course we're going to be doing things to improve the performance of our kernel and the functionality of our crenel on into the future and some of those things are going to require changes in the file system and if you have one written you're going to have to be inside of the loop you're going to have to contact Apple there's other stuff that may be involved but we can only tell you so much in an hour so the primary message is talk to Jason you if you're thinking about building a file system contact Apple you're gonna have to be in the loop now we do a little demonstration here because we like to bring concepts home at the file system session and so I am a rogue kernel file system extension and my compatriots here Pat Derek Scott Robertson you mish-mosh and Pyun are the kernel and this is me an inappropriately version kernel file system extension attempting to load I [Laughter] [Applause] think you get the picture here's some other sessions you may be interested in at the show to help you with building applications that are filesystem centric or even building filesystem extensions open source at Apple is happening at 10:30 right after this session in Hall a2 there's a session on AFP server and the Apple share client file system in Mac OS 10 that's happening tomorrow in route in this room at 3:30 there's a carbon performance tuning session happening and halt to tomorrow at 2 o'clock and a Apple performance tool session happening in room a to Thursday at 5:00 where you may get to look at your third demo of FS usage we think FS usage is so important that if you come to the world by developers conference you should see it at least twice possibly three times likewise leveraging bsd services will happen in the Civic Auditorium Friday at 2 o'clock the Darwin kernel presentation which will give you an idea of how the kernel is structured internally the mach kernel and some of the bsd kernel services outside of file systems and networking that's going to be at the Civic Center at 3:30 and the Darwin feedback forum will be Friday at 5 o'clock that's all we have for you today I'm gonna ask Jason now to come up and he's going to moderate our question-and-answer I'm gonna bring Pat Dirk's Scott Roberts emission fire champion and Don Brady up on stage from the file systems in kernel team and we'll take your questions you