---
title: WWDC2000 Session 104
framework: wwdc
role: article
path: wwdc/wwdc2000-104
---

# WWDC2000 Session 104

## Transcript

Kind: captions Language: en ladies and gentlemen please welcome demanding wait okay now as I was saying please welcome the manager of Mac os10 file systems engineering Clark Warner thank you the stage police are watching me and I've heard from my spies that when I turn around to jump on the stage something Bad's going to happen to me I have to be careful I've noticed that the stage is higher this year than last so I think I'm ready thank you welcome so we have we hope some really good information for you folks whether you're just a user of Mac OS 10 whether you're a developer on top of Mac OS 10 or whether or not you actually want to implement a file system or some kernel extension of the file system and we hope that by the end of the session you'll have some vague ideas of what you need to do and whether or not you need to do some of these things so we've got information on the interfaces we've got some information on differences between mac OS 10 and mac OS 9 that come from our bsd heritage and we'll also have a little description of some of the differences between the different volume formats that we support mac OS 10 we have a relatively significant architectural change to tell you about in Mac OS 10 involving our buffer cache and so we're going to spend some time on that at the end of the session hopefully enough to give you a feel for what we've done and why we've done it and for those folks who have actually been to a file system session before then you're going to see a fair amount of review and we think that's needed for a couple of reasons one is as Steve said in the whole keynote on Monday Mac os10 is becoming very real and they're probably we expect a lot more folks who have not actually been introduced to any of the technology in Mac OS 10 who are going to need to hear some of the stuff we've actually talked about in pre the accession and also given that it's much more real now I imagine there may be a few of you who perhaps weren't as concerned about what you were hearing in years past and and will want to hear some of that stuff over again but that having been said there will be some new stuff let me start by giving you a brief overview of the core OS if you went to the Darwin overview session then you've seen this picture already and in fact my tech lead has surprised and supplied me with this really handy laser pointer so I can point out the coolest part of the Darwin systems and it goes like this we want to tell you a little bit about the internal architecture of the file system as well and this this diagram is something you're going to you're going to want to keep in your head as we move along in the presentation so to keep some several key things I need you to remember the application environments classic carbon Coco they all have their own way of interfacing with the file system and we're going to talk about those interfaces a little bit at a later time in the presentation the actual stuff that my group works on is here inside of Darwin inside of the bsd kernel and we have two interfaces of particular significance which we'll talk about in some detail today also as well as the collection of volume formats that we support so our basic interface is a bsd unix interface but vfc traditionally doesn't support in its volume formats some of the sophisticated features that have been available in mac OS in the past so namely things like resource forks catalog information and that sort of stuff and so we've extended the bsd interfaces to allow for retrieval of that and establishment of that sort of information likewise inside of bsd there is a relatively clean abstract layer called the virtual file system layer and that basically separates the file system dependent from file system independent layer inside of the bsd kernel and if you wanted to develop a file system yourself because you had a new device type or if you wanted to develop in a file system plugin which will intercept all the file system calls and then deliver information to file systems beneath this is the interface that you would actually write to we're not going to talk in great detail about the volume formats but suffice it to say that we supported numbers of in Mac OS 10 we've got support for the BSD fast file system which was our implementation of the UNIX all system HS s HS s plus the network file system AFS by virtue of the Apple share server team they've done an eighth of a sorry NFS client implementation and and more to come so here's how we're going to do it here's our session outline we've we're going to do a local status update to let you know what's changed in Mac OS 10 we're going to talk about some of the new features of the system since last we got together at last year's World Wide Developers Conference focusing on some of the stuff we've done for DP for and for DP three and we're gonna as i mentioned talk about some of the interfaces the various ways of getting at file system functionality in mac OS 10 and we're going to talk about some of those differences i mentioned between different volume formats and arising out of the fact that we have a VSD system and a mac OS 10 system of mac OS system coexisting and we'll talk we'll do a review of the virtual file system interfaces themselves to give you an idea of what they look like and how you might make use of them and we're going to talk in some detail about the changes in our buffer caching architecture so that again if you're one of those Darwin folks that are building a new file system you'll have an idea of what sort of things you'll need to do to change to work with our new unified buffer cache the next new feature could be yours we're going to talk a little bit about Darwin and maybe some ideas about what things could happen next so here's our status update back at this time last year we demonstrated to you for the first time ever in public a system a UNIX based system in that case Mac Mac os10 booted off of an hfs+ volume and with the first time ever in public and this year we're happy to report that this is just a standard humdrum boring old part of the system now a DP for and even back in DP three the primary volume format is hsf+ the server team has added support for apple filing up the Apple file share protocol as I've mentioned and it's actually in deep before if you look in your demos folder you'll find an application called test AFP client and you can use that application to mount an apple share ip-based volume on your mac OS 10 system now it's still a little early it's the little new you may run into some bugs but it does work i have used it we also are well underway with our universal disk format implementation it's not in the release yet turns out that UDF is very much a kitchen sink of file Systems it has just about every feature any file system person could ever imagine and so it's quite a challenge to build support for into the system but we're well underway with that we also have added support for large files it's not exposed yet for your use because as a result of the new caching architecture we're actually relying on the virtual memory system for i/o in some cases and the vm interfaces are still 32 bit but all the file system interfaces now are 64 bits so that will be able to support very large files in Mac OS 10 and and the virtual memory system is going to change to accommodate that also important important to note is that all of the interfaces into the core file system at the bsd level are you TSA to utf-8 strings the important thing about utf-8 it is a single bite character encoding of unicode characters unicode for those who don't know is an international standard that allows for just about every character in any language that's in popular use today up to 65,000 character spaces are reserved in the Unicode set generally two bytes but a single bite encoding of Unicode was invented and it's called utf-8 the key feature of utf-8 is that you can pass it through normal c string interfaces it guarantees that there won't be any null termination where there isn't supposed to be and the first hundred and twenty eight bits which lipari the first 128 characters in utf-8 are just regular ascii characters and so for people that are just dealing with a regular ascii system it's going to look the same to you even though we've dictated throughout the kernel that all file names are going to be utf-8 not until you start seeing ship shifts and so forth kanji representations that the byte stream will start to look different but the beauty is again you can pass it through carstar interfaces we also have a number of new features besides what we've given you on the status update and we're going to talk about those in more detail actually have my tech lead talk about those in more detail as as we move along perhaps apologize a little bit our radar screen hasn't changed much if you don't remember our definition of the radar screen is things that I think we're going to have at some point but i'm not sure and I'm unwilling to tell you so you know you can ask the question but that's the answer you'll get but we do not recognize that a lot of people are really interested in a directory change notification feature and quite frankly we had sort of hope to have it by now but they other things that took priority it is however something we're still considering another thing we'd like to do is allow NFS access to hfs+ volumes it turns out this is actually quite a bit trickier than it would seem on the surface an HSS after all each of us plus after all rather has pretty much all the features now that aufs or birthday fast file system has and so it seems like it would be trivial to do the same things that ufs does to allow NFS access but the problem is there are a number of architectural issues with mac OS 10 overall that makes us rather tricky and in a nutshell the carbon application environment when it runs across the volume that doesn't support some of the features that are associated with hfs+ like resource forks and catalog information access by ID and so forth well it mimics that for you it actually builds its own environment for creating its own separate resource fork and its own separate catalog information does its own ID to pass me mapping so that as an application you don't have to care if you're on a ufs volume or if you're talking over NFS you can make the same carbon calls you've always made and they'll just work well the problem with that is on HSS plus we have local support for all of those features and so if you were to export an hfs+ volume over NFS locally the clients would use the native resource work and catalog information remotely because the NFS protocol doesn't have support for this sort of thing carbon is going to see that it's in FS and start mimicking it and so the remote clients will have a different view of the resource fort data and of the catalog information and the local clients will do so that's that's actually quite a problem for us we're still figuring out how exactly to solve that in the short term we'll probably have some configurations where you can NFS export hsf+ volumes even though those problems aren't solved and for people to just want to use it in a server configuration and don't worry aren't so concerned about how the local access works and are able to step around the issues we also would like to add some the file system specifically the joliet extensions to iso at some point we'd like to have an NFS implementation and even more importantly an implementation that will allow us to read files from dawson windows so the fat 16 or fat32 from what are the windows format file system so now we actually do have a demo and i'm going to call conrad mint juleps to bring it to you I'm always a little bit jealous of the QuickTime guys because they get to do these big fancy multimedia demos and they have movies that's really exciting and and file systems we never get to do that stuff and this won't let us do that stuff so we're going to sort of skirt the edge a little bit and try to get in at least a picture for you and with that i'm going to introduce conrad mitchell who is one of the file system engineers on my team [Applause] so this actually got working about 20 minutes ago and hopefully it's still working there's them a dvd-ram disk installed in this system down below and it's already mounted and it's mounted on / data here I can see you yeah I can see I have one movie already there so last year at this time we gave you the first-ever public demonstration of booting off of HFS NHS S Plus you can imagine what might happen next year when you all pay your money again and come back to see us at the worldwide developers conference in 2001 so with that I'd like to bring up at dirks my tech lead and he's going to walk you through some of the cool and awesome new features that we've added to the file system in Mac OS 10 thanks Clark thanks that I'll do without that it's pretty amazing to think that last year around this time we just very carefully demoed the first booting and routing often HS + file system it's been quite a year I can take a moment and talk about some of the new features that we've added prime amongst them probably once you realize that you are running on hfs+ file systems is Mac OS 10 is a multi-user system and at its core is a bsd base and every file and every directory in the system is going to have permissions assigned to it those permissions are a bit like what you may have seen on a FB file servers but they're not quite like it there's a similar division between owners of files and directories and sort of everyone or guest access class of people and then there's a group that you can assign to every object and you can assign separate read write or execute permissions to each of these groups underneath although you'll see names in most of the interfaces these are mapped to numeric identifiers that are mapped out of a database and what access you have to any given file or directory is determined by the user ID of the running process that's all fine and well if you're running on a single big UNIX system where things get mounted and they stay mounted all the time but it's a different story if you're running in a mac OS type environment where you've got zip disks that you can around that you're popping in and out of machines all the time and so you have a problem where the numeric IDs that are assigned on a disk that may be created on your neighbor system may not mean anything in the system that you have created or than a map to entirely our names so simply plugging in a disk and using the permissions on there even though hfs+ has provisions for storing the user ID and group ID and permissions information is not sufficient you have to worry about what to do when their permissions on does it make no sense we wanted to make that as transparent as possible and to that end we decided to tag every volume with a 64-bit ID that we make up the first time that volume is initialized and the system actually maintains a list of ID's of volumes that have knows about and it only uses the privileges on those disks that are whose ID is in the list if the ID is not in the list we fall back to a mode where we ignore all the privileges on the volume and we make the person who's logged in on the machine the owner of all the files and all the directories on that system and the only thing to be aware of is that if a disk is used in that mode and you create new things on there those will also be tagged with unknown and when the system when that this comes back to the originating system where the permissions are used they'll see unknown on there and they'll get the same treatment the person logged in on that machine will be the full owner of it but that's kind of a little subtlety to be aware of but basically what we strove to do was to approach as much as possible the regular mac OS experience of being able to take a disk put it in your machine and use it in any way you see fit and we don't want permissions to get in the way of that something else we did is an extension of something that is an API that's already provided 4afe URLs allowing you to manipulate server references as URLs and have the system mount those for you so this again is currently supported the only for AFP but the ideas that will extend it over time and it's it's the perfect implementation effective is the implementation fully connect to dialogue in the finder where you just type in a URL and the system take that URL figures out what file system uses that finds a place and the file system to make that appear creates that directory mounts it makes the server connection I makes it available to you so it's a very nice API we're working on that it's still evolving if you think you have a URL scheme that would make sense to tie into that we're still discussing whether we want to make this dynamically pluggable or small static list come talk to us afterwards but that's how we envision providing access to urls over time through the file system we've added a number of features to the system to help support the age of plus AP is long filenames or prime amongst them on Mac OS using the Unicode api's you can now create and manipulate files with 255 unicode characters and we've extended the BSD api's to do the same thing as Clark mentioned utf-8 is now the standard interface for all files and directories in the file system and so we allow you to pass a utf-8 encoding of a string of up to 255 unicode characters into the kernel that way so everything is done through utf-8 and the colonel and last but not least we surprise supplied the new system calls that allow us to support the bulk enumeration calls and we'll get into some more detail about that in a moment I want to talk both a little bit in review and a little bit to bring you up to date on the interfaces and talk about the differences between the behavior of possibly the same code on Mac OS 10 and Mac OS so when you're writing your application you basically got your choice of one of three environments and they all make sense of sort of different times they're all layered on top of the common bsd layer the owners clark mentioned that's where we really do our work and inside bsd there's a virtual file system or VFS layer that all file system calls get translated to and that takes care of doing these concerts and specific implementation of some particular feature or operation so carbon is the natural way to port your application if you have a Mac OS application it should be a cinch to carbonize it and the carbon interfaces are you know in most ways exactly like the existing mac OS interfaces if you're adding a brand new application you may want to consider writing to the coco AP is there a completely different object oriented set of api's they were designed from the outset to be very affordable to a number of different environments and so they don't support some of the features that you may expect on HSS or on Mac OS so there were no resource Forks there's no support for finder info those sorts of things but you can have a very uniform API to general file storage finally of course there's classic if you do nothing to your application it will run in classic and obviously all the calls are supported there too and as I said all of those are implemented on top of a bsd layer that we've extended with an actually surprising suitably small number of new system calls I guess the first thing to be aware of when Mac OS 9 came out we introduced a number of new API to let you operate on files and directories with Unicode and with FS refs a new file a new form of file reference and all those calls are supported in carbon so I think I think the message is clear think Unicode think fsr it'll work great in carbon the full range of calls is supported cat search is supported it's not something that's in traditional BSD but we've added system calls to support that and so anything you can do in nine-point-oh carbon fully supported the new get Callaghan for bulk call that I refer to is a great call it's a bit like the AFP a numerate call which lets you get directory information on a number of objects in the directory in one single call and it has a very flexible method of specifying what it is you're interested in about each particular object so you can say i want to get just a name and the finder info for instance for all the objects in the system and it will return just that to you you'll see that actually appear in a number of different calls and some of the new system calls as well there are methods using a bitmap type of specification to let you specify which attributes are which pieces of information about files and directories it is you're interested in and it's great because the underlying solve systems don't have to go and possibly make up information that you're not even interested in I mean for the longest time on Mac OS we've carried around this number of directories in the root field just because it was there in MF s when the Mac was first introduced and we had no way to tell whether somebody who called get ball info actually needed that field or not and so this will help us move away from that so the BSD extensions get out early and set a duelist our new ways of getting information out about individual files or directories they basically let you specify a buffer that your information will be filled in two and a bitmap with a couple of fields very much like the AFP gets all their problems call again you can say i want the finder info about files i want the forks lengths and that's it and you'll get exactly that returned search FS is a very direct implementation of the cat search functionality and all the things you can do with cat search you can you can do in this system and it's all implemented through search of s exchange data is an implementation of the file exchange functionality will let you do save saves of data and these are all things that we needed to do to support carbons functionality but there were no equivalent bsd calls to do it most other things actually map pretty straightforwardly to existing bsd calls all the various variants of open and open DFS and whatnot oldest map to the straight i'll be at the open colon for those areas or we didn't have any system calls to support them we made up these new 1s control is an existing VSD system call it lets you do various operations on open files and we added an option to let you do file allocation for instance the allocate call was not something that's known in the bsd system get der entries adder is the system call that was added to do the bulk enumeration and if you are at all familiar with the attribute list get out a list and set out a list API you is very familiar there is an attribute listed as fast and a buffer and a size that you specify and a few parameters would let you say which entries in the directory you're into interested in how many items you are prepared to take back and various options that you can pass and with that one call you can get information on a whole number of directory entries all at once possibly the whole hole directly in one single call so it's a very efficient way to enumerate file data the VFS layer is something we'll get into in a little bit more detail later but as I said that is the layer inside the kernel that separates the file system independent system call layer from the process and specific implementation of certain calls and it basically manipulates these abstract data structures called v nodes that are specific to each file system and there's a dispatch table that's associative H with H V note that lets the system dispatch to the concert some specific implementation of some call if you're writing file systems and you'll get very familiar with that layer of interface now there are some differences in the way that code works even if the code looks the same or similar uniques for instance very flavors of unix and bsd among them lets you delete open files which was always a big no-no under mac OS you'd get busy error if you try to do that we support both semantics if you're calling the delete call in carbon you'll call a variant of unlink that actually will return an error if it's open but if you are a bsd or a Coco application calling unlinked to remove a file from the directory that will succeed even if it's open so it's possible that solves you have opened will get deleted from the catalog that's not as well of an adjustment as you might think because even under mac OS 9 it's always possible that something would get deleted or renamed out from under you and no longer appear where you thought it appeared and Don may get in a moment to the kinds of games we play to make this happen but it's something to be mindful of hard links are a strictly unix ism it's a way for two entries in the catalog to refer to the same exact data on disk and this is this is not an alias that you resolve to get to the data these are two equivalent names to get to the same data on the disk so again that may be something to be to be mindful of one thing in the UNIX all system ap is it's very careful not to return you data that you have not yourself written and if you're just streaming out of file that's no issue you're you're writing everything before you'll ever read it but you should be careful of some things that may have made perfect sense at times on Mac OS like all I'm steady us to extend the file to its maximum length before you start writing the data now to prepare for the possibility that you might actually read that data since the logical and the files been moved out that whole intervening gap between the current us and where you're setting is 0 filled so you may be doing gigantic iOS where before you were doing absolutely none calls like allocate are fully supported and there are much better way of making sure that the data is available on disk before you make your right call for instance then set eof is one sort of quirk in the bsd api is that there is no notion of a createdate the way there is no hfs+ hfs+ volumes have them and we support them but if somebody uses the bsd api is to manipulate the modification dates of file the create dates are actually untouched and until we decide to do something about this you may actually see modification dates that are prior to the creation date of a file it's a little weird but against something to be mindful of and then of course about separators unix PSD all users / to separate path names and HFS mac OS uses colon now most mac applications have long been discouraged from using full path names but if you have one that does usually aware that the pathnames that you use in your carbon calls are not the same path names that you should use if you make bsd called the system underneath actually translate between colon and / for you and the results may be surprising if you call BSD api's to enumerate a directory you may see files with slashes in them but they will appear to you with colons inside the name because we can't pass up a pathname through the BSD layer that has a slash in it we can safely pass one with a colon in it so between carbon and BSD in the system there's all kinds of colon and / translation going on and if you are spanning these two layers if you're doing some BSD calls and some carbon calls you should be careful about that name separators finally we support a range of different file systems and they all have their own implementation and they all have their own quirks unique systems be is the no exception the ufs file system has long been case sensitive we decided not to make and case sensitive implementation of H of S or HS + so HFS is the same case insensitive that case preserving file system but you may be running into other file systems unbeknownst to you because you're you're just reversing the catalog hierarchy and run into ufs volumes that are mounted there which are case sensitive so if your application counts on being able to find the same name under two different case representation that you're going to be surprised and that's actually very much very common in code where you refer to a header file first by one name led by another if you do then you FS you may run into trouble like that sparse files or something that you FS has long supported that there's been no precedent for on hfs+ although we've had some requests for it on occasion basically sparse files are away for the file system to store your data files without storing the gaps in the data blocks that you've written so you can seek out to a location two gigabytes into your file and ripe 10 bytes of data there and on ufs that will actually only take about ten bites on disk that's not something that's supported on h plus plus so if you try that because you're porting a PSD application or something you may be surprised to find that step one is to allocate two gig of data and then go write your 10 bytes of data so hard lengths are supported on hfs+ but not on HFS they're supported on ufs and NFS as well but again if this is something you count on you better make sure you know what file system it is you're talking about and so there are a number of subtle gotchas even though the basic application that you have if you port of the carbon and running on age 50 plus volume will run remarkably well and changed and you may be unaware for different altogether at this point I'd like to bring up Don to talk about some of the differences between the way hm s+ volumes are used in the system don thank you ok that's it i'm going to talk about the some of the cage we made to the HFS plus volume format in general most almost all these changes should be transparent to all of you out there but some people like the hard code things and so for that reason you're going to have to suffer through all these boring details here the first set of changes involve what we do during initialization during in Mac OS 10 for each of us plus swallowing formats primarily we made some changes for performance and for some of the features like hard links that we added the allocation block size which used to be variable size or not had different different sizes now we settled on 4k because of the vm page calf size and the b-tree we up the catalog 28 k and the extents to 4k and one interesting point is when we did the original hfs+ format we chose 4k in 1k so that people wouldn't hard-code a 4k as the node size but when we went to 8k we discovered that some people like hard coded so don't do that the HFS wrapper we added that for support with her legacy systems for the booting and for interchange so that you wouldn't get too uninitialized volumes when you when you mounted them going forward we're using a different strategy for mounting volumes for booting up volumes rather and so the rappers are going to now be optional having said that if you use if your strategy for reading the volume name off a partition is to go into the rapper and look at the MDB you'll be disappointed because it won't be there like I said we have the startup file now where we use that to boot for Mac OS 10 however I should note that that startup file is not in the DP for release that since that will be in the follow-on release the last mounted version up until now had been set at 8.1 now that we're using a different code base we've upped that to 10 point 0 as Pat mentioned there's some bsd fields that we now are used the permissions data and the last access state we now in force permissions and we initialize them under mac OS 10 and on every read and write access we set the last XO state finder info is mentioned here it's not really finder it's not really fossil some data but people tended to use it in file system calls so i mentioned i mentioned in it here typically like the type and creator and the invisible bit are often used however you should be aware that in mac OS 10 often they'll be uninitialized so if you're doing searches for criteria on type and creator be aware that they they might not be initialized one thing one interesting we discovered when we were debugging is that there's a occasionally in time in mac OS 9 you'll get file names with embedded nulls mainly do because they had a pascal interface and you could actually do that it's not a good thing to do but so we to support those and not not trip up some of the stacks and the faucets and layers of we we allow them to go through but you can no longer to create them we in the past week we'd really didn't have any reserve file names but we a long time ago pre system 7 we discourage the use of dot dot prefixes and files mainly because the file system and the device manager shared a common open call we needed to differentiate between opening a file and a device that support is no longer there so we no longer discourage and in fact given our VSD underpinnings we have lots of files that start with the dot but that's not a bad thing anymore of course you should avoid naming any file dot or dot dot because we use those in the bsd layer there is some pseudo metadata additional we in the past we had apple share and like the desktop DB now we have symbolic links and the hard link and the deleted files all this stuff should in terms of the pseudo metadata should be completely transparent in carbon you'll see symlinks as aliases and in the POSIX and the cocoa layers they'll actually automatically resolve so you won't even see them but we just tell you about that so you know about them we updated the disk first aid and the fsck to be able to fix hard links and we came any of the files that were we're hidden for the the lean on busy and if you're doing any type of repair utility we encourage you to also add that minimal support into your application back to you Clark thank God well if you attended Steve's overview on Monday you know that there are a million different ways that quick tim technology is distributed on CD ROMs of famous artists sorry up audio CDs of famous artists he also actually had a cereal box with a CD in it that had quicktime technology well I have great news as part of the whole new era of respect for file systems that got us this large hall to get together in my boss just informed me that we have made a deal with the makers of generic cereal for the distribution of flouse and technology in the lesser stores near you [Applause] all right let's talk about the virtual file system layer so to begin this is the layer in the file system that you would implement too if you wanted to develop a new system there are two fundamental uses of it a new file system for a new device type or a new protocol or what we call a file system stack the WAV file system stacks work is they basically implement a set of functions to the VFS interfaces and they sit on top of an existing file system and at the bottom they call that same VFS interface for a lower file system to process so for example if you wanted to build an encryption system you could build a file system stack you would support the VFS interfaces most of your VFS calls would simply call the file system underneath underneath you but if you wanted to do I oh then you would look at the buffer encrypted if it was coming in decrypted if it was coming out and so you would call the file system below you to get the data make your manipulation and then return to the caller or you would get the information from the caller through your manipulation and then call the file system below you to finally do the right so that's that's roughly what the VFS is for two key data structures you need to know about if you are really seriously thinking about implementing a file system or a file system stack and the first is the v known which stands for virtual node and this is the key data structure that represents a file in our system if you programmed in Mac OS it's similar to a file control block on FCB but there is one key difference V nodes are created whenever a file is referenced for any reason not just when a file is opened so if you just wanted to get filesystem statistics if you did the equivalent of a get cat info we're going to allocate a V note inside of our system and we're going to initialize it and it's going to stay around for a while we have a cache of these in the system and up to thousands of them but just stay in case you want to reference that file again so we don't have to recreate it and we also have another key data structure a mount structure which is kind of similar to a volume control block so these are these are the two key data structures that you'll need to work with if you're wanting to implement a file system there are also two key segments of the VFS interface there are what we call the FF calls and I apologize for the name confusion but we inherited that and there also vinodh operations or Bob calls the reason they're called this is because if you actually look in our code if you look in the files with an independent layer of system call processing you'll see some work go on and then you'll see a call to something called VFS underbar something or via underbar something and what those are our macros that look for the dispatch table for the write operation to call given the type of aveeno that it's about to make the call on or the type of file system it's about something to call on if it's a VFS operation you'll see VFS underbar operation and it will call the VFS routine for that file system if it's a file based operation you'll see a V off underbar operation v op remove the op open V up close and so forth and it will make that call to the underlying routine in that file system so while you see all these V op macros scattered about if you look at the implementation for any given file system you'll see they have routines called in the HFS for example HFS closed HFS open HFS removed those are the routines that will actually be called when you see V ops calls inside of the code so let me give you a walkthrough of some of the VFS operations actually I think we have the full list of these one quick note to instantiate a file system in bsd and this is the model we've inherited mac OS 10 you first create a directory and then you mount the device which corresponds basically to our petition on a hard disk driver or some protocol that elysia's to a hard disk over on the server you actually mount the device on top of the existing directory if the directory was not empty all the files it and directors anything that was in that directory disappears from view from your system until you unmount it again so usually we do this on empty directories so you'll create a directory you'll mount the device on top of the directory and from that point on that directory will be a window into all of the files on that partition that you've mounted there in that place the FS mount is VFS unknown are the calls you use to manipulate that so the FS mount is going to create and fill in the root V know the V known for the root directory that you've just mounted on that directory and it'll initialize the mount structure the SS sorry VMs unmount is going to clean up the mount structure and remove that device from that file system you create from the directory that you've created and from that point on you'll no longer be able to access the files on that volume run that partition but you will be able to access any files that were left in that direction when you mount it on top of it the SS roots job is to give you back the root vinodh given amount structure so if you're trying to do fascinating processing you're starting at a mount point this will get you the root V node status s is just a get statistics call it's kind of like a get vol info it is give me some information about this overall volume things like the overall size or the amount of free space available that sort of thing the FSB get will get into detail but this is the call we use to support access by identifier so if you want to implement a file system that allows access life directory ID name the way hfs+ does then you'll need to support VF sv get the FS in it is just your basic file system initialization routine you may or may not need to support VFS in it it depends on whether or not you have some global file system wide data that you need initialized at mountain or I mean at be when your system is first initialized the SS fh to vp the stands for file handle to v node pointer and vp to F H stands for V node pointer to file handle and these calls are specifically for supporting NFS access to your volume these are the two key guys for getting file handle sent across NFS for remote access and sis control and quote Apple will go to detail about those but those are for supporting system level controls of your file system and for quotas we're not going to give you an exhaustive list of all of the vino de operations but I will tell you about some of the important ones and these are the ones you'll actually be tested on later when we talk about our caching architecture so bop look up this is the most important and most complicated of all the vino de operations and its job is to handle passing resolution so basically you give it a pass name that starts from a directory that belongs to your filesystem and you give it a string that contains a bunch of past name elements separated by a slash and then this routine will give you back of inos and it's generally called it early as we do the lookup process in the system so you'll get a component pointer you'll get a V note that corresponds to a directory to your lookup call you will turn the V note for the next thing in the past you'll get called again if that's one of yours and it will go on and on and on until look up finally reaches the end of the path and gives you back the V note for the end file that you wished to target probably worth mentioning that Mac os10 has a singly rooted hierarchical file system layout much like bsd the carbon interfaces will abstract that for you they'll know where all the various mount points are and you can ask the carbon interfaces for a mountain list and work like it did in Mac OS 9 but in Mac OS 10 there will be actually a complete hierarchical file system list and sobieski calls will work and the cocoa environment which relies on that will also work and so look up becomes a very central routine in Mac OS 10 open and close are what you're expecting their the two routines that enable you to start doing access to a file and rip out that ability later read and write basic read and write routines truncate which is used to both grow and shrink a file the name is little unfortunate but you can truncate a file and send send its weight its valid space way out to the end moving cof or you can also do a truncate and reduce the size of the file because all the way to 0 if you like eps inc is file synchronization this is the call that takes all of the dirty buffers you may have associated with your file in that system and starts I yelled to the disk on those so that we can get all the data flushed that we have cashed in memory out to the disk you can make sure that your file data is consistent it's a lot like flush files in Mac OS 9 and bop remove that's basically the delete call as we mentioned before in order to give the full range of abilities of HFS hfs+ things traditionally associated with Mac OH we've extended the BSD interfaces to add things like it out our list and sat or list that will get you to things like the catalogue information a lot of you searching while all of those calls are also supported by underlying vinodh operations and so if you wanted to implement the file system that supported getattr list you implement bob get a doorless the way we didn't hfs+ and then those abilities will be available to your callers so now we're at the risk of real hard stuff you want to talk to you about this big change that we've just made in our caching architecture and anybody who's actually thinking of implementing a file system will need to know this and will also tell you about why we did it the changes in the caching architecture were actually done by the bsd team with some assistance from the file systems group but this is something that spans both file system development and the virtual memory system and so we're fortunate i'm going to talk a little bit about the beginnings of this and later i'm going to bring up ramesh one of the bsd engineers will tell you the details about how that system works this is a rough picture of what our old cash looked like and up on the top top of your right that big box it says VFS there that's the VFS interfaces we've been talking to you about and then that's our underlying support for all of the different file systems this is I'm going to walk you through how the buffering used to work when a read call would come through basically the overall bsd call Reid would come in and it would go to the VFS layer we dispatched the appropriate file system that was going to service it and then that file system would most likely call the buffer cache underneath the buffer cache might have that data already read in memory because you've read it before or because you were doing sequential access we anticipated that that page was going to be read and we put it in memory for you in advance and it will satisfy the request out of the buffer cache and if not then the buffer cache will call back to one of your routines to actually read the data off of device or get it off the network so that's the basic that's the basic deal come in with a read go to the file system that supports it talk to the buffer cache try to satisfy the page that it generally works pretty well and the buffer cache can be used to optimize performance well there's this other way of doing i/o in our system that we call file mapping and some of you if you've worked on bsd may be familiar with this but the idea is you can open up a file and then you can decide that a certain range of address by the way I appreciate the applause over there a certain range of address will be back by that particular files they'll say hey I'd like you to make addresses 1000 through five thousand valid for me and I want you to use as their backing store this particular file that I'm telling you about and then if you read anything in that array of addresses you'll get the data that's actually in that file that you've decided is mapped into that array of address so it's a very handy way of doing I oh but it does cause some interesting problems for file systems is the first one being the cash for the virtual memory system is different from the cash for the file system released it was in earlier versions of Mac OS 10 and so to walk you through this you have a process he comes along maps a range of data says he would like it backed by this particular file and then starts reading the memory what happens is the vm cash will start doing I yo via new calls called page and page out they'll eventually call the file system underneath they'll get the data and then they'll put it into the vm cash and the vm cash will then satisfy those requests but what can also happen is one process can open the files I read and write another process can map the file and either of those processes can now write to it the read and write process was going to write to it via the normal filesystem read/write API and the data is going to be cash in the buffer cache the mass while is just going to write into an array of memory and it's going to believe that that mapped into that file but in the meantime it's going to be stored in the vm cash until the vm system decides to flush it so if two processes open up the same file by these different mechanisms they can actually wind up with inconsistent data that corresponds to the same address in the file for that page of memory that's kind of actually diagrammed here you'll see a page of memory here in our buffer cache that's supposed to correspond to a page of data in the file another page in the vm cash and you can see how these guys can become inconsistent that's the big problem we have these two caches and the data can become inconsistent in these two caches because read/write is using this files for the buffer cache and the vm system is using a separate cash and as long as these things were totally distinct you could get yourself into trouble now to try to avoid some of those problems in DP three there were a number of synchronization points in the file system where we would check with the vm system to see if it had data and try to get flushing to happen and so forth but there were still some holes in there and while it was unlikely you might have stumbled across some cases where your file system data looks sort of inconsistent so what we've done oh let me tell you what we did a Mac was ten server we recognize that this was a potential problem so we had a solution for this issue in Mac OS 10 server what we did was we implemented something called the map file system and what it did was turn all regular i/o into Madaya and so the vm cash became supreme in a mac OS x server system you would do a reading you would do a write and actually ultimately would turn into a map call and we would map it into a spot in the Colonel's address space as opposed to the process of the address space well that worked pretty well and it did get rid of the potential data inconsistencies but the buffer cache was a fixed size and rather tiny underneath that was a bit of an issue that the end system again was reigning supreme and so some of that read ahead stuff might not be happening there still were two copies of data in the system now while the vm cash was sort of super imposed on the buffer cache so that you would never get the data inconsistent you could so complicated atta because the vm system would ask the file system the father's can we get out of the buffer cache and it would go back up they were never different data but there were still two copies and that was inefficient only part of the file could actually be mapped in the kernel we don't have the full range of process address spaces we had one kernel address space and so that causes to have to do some funny things like potentially throw out mappings for different processes as new processes would come and go and and wish to do I oh because we only had a limited address space that meant that not all currently map files made could necessarily be mapped at the same time and you know you might not be able to map the entire file at once because we had to deal with this limited address space so we're throwing mappings in and out of the colonel all the time causing potential performance issues and it also wasn't very friendly to the file system because reads and writes would come in and the file system wouldn't get them right away they would be redirected to the vm system the vm system had its cache and going and it would only call the file system when it decided to flush it's cash or bring in new data and so if you wanted to have a counter in your file system of all the reads and writes that were taking place you might you wouldn't have them at the time they use your ass you'd only see them when the vm system was ready to communicate so that was an issue too so what we've done is we've just unified these caches bouncing bullet was a lot of work but it but it's into DP for and so we got rid of all of these issues and now there's just one cash for both the vm system and the file system and there are all the pages that are associated with the file are satisfied out of the same cash so to talk about some of the details of that we've brought up ramesh was one of the bsd engineers and he'll walk you through some of the implementation of the unified buffer cache hi good afternoon my name is Ramesh basically as a suit for the croc pension that we had two kinds of caches in the system went to be in page cache which was a big thing and we had a tiny buffer cache which is much smaller bigger use within shoe buffer cache we had a data inconsistency the baby is to construct buffer cache was Rennicks Mackeson machine bruges earlier and very early Earth it's the second wheel ordered when transfers of control to the corner when the corner comes up initially it has a pool of pages select with this pool consists of all the physical pages in the system except for the pages that are consumed by the contextual data itself what we used to do is carve out a small portion of the pages from there and then these are meant to the use exclusively for buffer cache and this is where the fixer side is fix it sighs the bark is coming because these take a number of pages that we are trying to use that for the buffer cache as transmission before like we have one huge honking p.m. pays cash but one side and we had a small buffer cache cache in to what we did basically was to stop using the buffer cache for the file data what we did instead is to turn all the file io data to the VN page cache itself now that we turn all the white pages into the vm paid cash we we have to implement a mechanism to track all the pages corresponding a particular file or a single page corresponding to a particular file at a particular offset in the vm page cache for this elapids we implement a mechanism called Universal page list what Universal page list does for us is to go to get at a page a specific page in the system Universal page list is a mechanism to get at a page in a memory object and it actually also allows us the physical access to the page so that we can use that to do dma directly we thought ever mapping inside the corner address space if you want to know more about the page lift the condensation on 106 on the thursday morning nine o'clock they will discuss a broad face if you need to know one of the more about the sales tax audit what we did was to use these ups universal page lists and develop an infrastructure and since I structure underneath the buffer cache GTR the buffer cache a facial at be deeds will I get blog readers and things like that so not all the colors of the buffer cache appears never have you changed because we maintain the CM eps we changed all the plumbing underneath the other day to use unified buffer cache we also wrote a set of proteins that we need to sprinkle all a lot of different file systems to work with the unified buffer cache the best way for me to explain all this is to take different scenarios in a life cycle of a file and then see how incredible for cash works there okay first let me take up the open let's assume that we were opening a file in HS volume so that all the windows are issues that i'm going to be talking about or basically hfs+ be no operations and we're also let's also assume that we're opening the file for the first time in the system as part of the open the first open it as part of the open system called implementation the colonel the first thing it tries to do as Craig mentioned before is to convert the path leads from patmans to a file the presentation instead of our system so basically that for our example is going to be hazardous look up this interface look up is a beaner operation look up in order operation high surface one of the things that it will do is go and see whether the file already exists you know and see whether it needs to clarify you to press the correct flash was open and then it body does in weathered the file exists or not be created or the file was already there it creates the vinodh it creates a winner for the that particular file all switches references to the file all manipulations to file from now onward goes through this be known in the lookup we added our first of support rotis that was talking about false conspiratorial if I prefer cash Carly busy infinite what does you be seen for leaders basically he's create a memory object and create an association between that memory object will be just created until we know which we just created it's a one-to-one Association for example if we're looking at a two thousand offset inside this we know it translates to a two thousand offset inside the memory objects that we created now let me take another scenario where we just random map default let's say we take hostage case of the forest I'll be mapping the current address space suppose a user this usually basically comes to the kernel as an EMS system call what we do only map system call is basically set up a pager and then allocate a 32 carriage because I'm one example study 2k these are a 30k at this range in the user process for this particular file patty and then we return the address have this where we allocated in the user space back to the user to the American car note that we haven't done any page I will not daily file i wore nothing yet nothing has happened yet when the user first access this address range let's taken as in our example he's leading the first bite of this allocated range what happens at that point of time since there's no page back in these are just arranged it's going to generate a page fault the vision system goes and tries to resolve the page part the page is not there in the system so when it allocates a clean page and then try sticking the data from the file system for this page in our case since we're taking an example of HFS plus it ultimately goes to the winner pleasure and falls into HFS beijing call which is responsibilities to get their fault copy the data from the deal to this particular page which vm gave us and then it returns the page back to the vm and this page stays in the vm page cache and all future modifications that user would play with this particular page would happen without any knowledge in the file system or anywhere that let's say for our example we access some more pages i just put like a sparse and it's not always contiguous so we wrote each file k and then twenty to thirty to give in or files we have touch then now i'm going to take a snare where we just want to do a right on at an offset 15 just to be completely more interesting like i was 15 and then we just write only 10 bytes data into the we wanted to the file so we use a right system call to do that as part of the righteous and call implementation inside the kernel which will then translate to their particular be no operation for that file system which happens to be head to fight for us what we use we then use a universal k-swiss protein called reinforces request and we ask that the dream for vista crystal return of the page to starting in an offset 0 of the size which is equally to the page size what it does is it gives us the exclusive access to that page which was reselling the vm page cache once the page has been obtained we then copy the data from the user before which he had to this page once we copied the data back into the system then depending on what file success want to see either can write up the data immediately to the dish are we can they can delay postponed while you could risk it had suppose we actually delay the right so then all we do then is babe it's basically written the page back to the main page cache that we used to by using a kernel you peel commit cause which is also coming from the universe Universal page list so basically we did a copy of 10 bars into the system which the which the user who has mapped this can also see the modifications that were already done ok let's I just took another example where he just 13 another page as well I'm going to know who to take another scenario where we will talk about the winner operation collapsing what a bad thing in our example it's competitive things basically the hfsf Nico's responsibilities to push all the modified data from the from the buffers back to the disk since we have modified data in the vm page cache we have a new protein which type thing also has to do a portion whatever it used to do before called ubc push 30 body bc push 30 does if cosa looks at all the modified data corresponding this particular file that was they will be in page cache and write it back to the disk that sells all the data ghosal system I forgot to mention that the sexing wiener operation can come into this once the users program can call guessing system call to make sure that all the modified data is in respect to the disc or it can also come from the system which periodically runs a system called card sink which goes to looks at all the modified data in the system for all five wishes then two diff so even if you don't explicitly college thing but they guys think back to the disc now I'm going to take an example of the trunk is we just want to target the file back to 16 k basically what i mean is the logical end of the file is set to 16 k so we have our support routine called ubc set size which basically what it does is it invalidates all the pages in the vm page cache which are being the end of the logical beyond the logical and lots of hot so as you can see in this diagram they're all the pages the remarks 20 k 24 k c8 k they're all gone they're all invalidated so they don't exist olivion page cache now i'm gonna go to the last you know that i want to talk about where the file gets deleted so basically everything that we set up four so far we just stare everything tom so we do they're done three steps first we call the UVC set ranch ratings which i mentioned in first part of the target which basically takes away all the pages that were there have been me and paige cash basically we set it to 0 as the logical end of the size and then we also call it another function called ubc release which basically dissociates the memory object association with a V naught and then we also need to call and the racine call you BCM cash which basically tells the insistence don't ever cash this memory objects because vm does cash all the memory objects are you have so this basically gives us a robe holder or review like how we have implemented the unified buffer cache and what kind of set of proteins that we did something like that now I'm going to curl five next two slides aboard like what each file system has to do and what kind of changes each file system had to go through to work with unified buffer cache when implementing the unified buffer cache we introduced five he knew we knowed operations actually two of them were already there three of them anew and in this file for mandatory you need to have them to what the do bc and the last one is optional the first part is a mineral operation called a block to offset but it basically does is given a logical block it converts and gives us back the file acid if you if you know already that all the buffer casualties like get blog we need everything was a logical block numbers use you tell BD logical number block and then it returns the data back so since all the memory of this work with the offset this basically tells us what the offsets conversion for that last number is because every file system can have their own logical block size and the next one is this exactly opposite it's actually our translation the office k logical block number which we use it in the page of fun to shoot any professor there in the cash we also have her 22 next next to on mineral operations page in and pager they actually existed before you require buffer cache to the slight differences in there like me first mini fight with page listed are universal pages are other than that there is nothing changed there and then the last one is used for the cluster filesystem clustering it's called zmapp it's very similar to be mad but be mapped to speed on operation does given a logical block number it goes in concrete action physical disk block number so that we can do an eye on it what see map does is it takes a file upset instead of logical block number and it also turns and returns the disks are it also returns but this block number corresponding so that we can do an IO when it also tells us how much contiguous blocks are there from darkest block on work so that you can do a big chunk of i/o if anybody wants to use file system first thing then they need to implement that too and as I already mentioned in all my examples like the old file systems have to sprinkle these 500 people yes five supporter things that we wrote so for example just to recap them ubc in point is we need to insert them whenever we clear the pillars and then we need to call you busy push dirty it's basically whenever you want to send the modified data back to the vm page cache of this cute implement that and then we need to call you bc sex eyes whenever you will find size changes the file size can change for example when you are growing if i buy right system call it changes the size that's the time when you need to call them and also when you are shrinking the files we're using a chunky as hard you love making a chunk it to the pollen in any case you need to call you be steadfast and then when you actually delete a file what's the file system actually does to do it honk it because they need to get rid of all the disk block the delegator this to things like that if you already don't do that then you need to or do a research first otherwise you need to do you bc release and UVC on cash we're going to be having a and we're going to be putting a document saying how to write file systems to you bc it'll be there in darwin sometime in your future that's be good to show all the code for the unified buffer cache and all the file systems are there on the system's with their dicks already in the darling and it's already there in the door which server now if you have any specific questions on which corresponds you will file system then we can talk about it then come see me we can talk about it later now i will get Clark back the stage okay thanks for mesh all right so that's it the only key thing left to tell you about is that the next new file for the feature could be yours this is our Darwin slide if you're really interested in the stuff we've talked about with you bc vino tops and so forth check out Darwin take a look take our code please some of the new things some of the ideas we've just sort of tossed around an NTFS implementation because we haven't gotten to it yet protoss the stuff we used to ship with Apple two machines we don't have a logical volume manager so venom might be an interesting thing to port we don't have a Linux file system support yet so X 2 FS might be interesting thing to port and you know force there's always your fest and we haven't thought of you some stuff we'd like to have you take away with you choose your interface carefully remember that there are some subtle differences that happen because of the volume format and there's some subtle differences in the what the interface is doing to not allow if you're doing a new object orientation you want to use cocoa if you need access to the resource for community new carbon etc be aware of those behavior differences that Pat talks about between back OS 10 and Mac OS 9 and also among the different volume formats we support if you do want to implement a new file system where a file system stack be sure it's necessary because you are as Dean Reese mentioned in the i okayed session going to be in the kernel and if there's something that goes wrong you'll crash the whole system so be sure you need to phone infest is probably not the best way to do modem access for example and be aware of the unified buffer cache because it has changed the way file systems interact with cashing in the system and finally make sure you use Darwin as a resource and i'll mention now even before the roadmap slide that today at six thirty and holiday one there'll be a birds of a feather session for the folks that are interested in Darwin and if you have questions about Darwin you won't you'll want to come to that Hall j16 30 tonight other sections you might be interested in there is a Mac os10 Colonel session tomorrow at nine o'clock in the Civic Center and if you're curious about universal page list will be some talk about that there there's also a Mac os10 application packaging and document typing session that may be important to go to if you want to find out about how we're doing like mapping launch bindings and I con decisions and stuff like that that's going to be tomorrow with two here in the big hall now queston bsd support is going to be in Hall a to friday at three thirty the crucial session if you're curious about the bsp infrastructure inside of mac OS 10 and the feedback form for mac OS 10 overall i'll be there as well as all the other core OS managers that'll be in room j1 friday at two o'clock p.m. if you have questions you want to get in contact with us about you know closer relationships or get on mailing list or whatever give a shout to John Sigma at cigna and apple com is the technology manager for mac OS 10 in apple worldwide developers i see we only have a couple minutes left and so we're not going to be able to do questions here on the stage but my team will be in the back of the hall over there if you want to come by and ask us questions and we'll also be at the birds i will also be at the birds of a feather and i'll also be on campus tomorrow from 7930 for the big campus event so thanks a lot for your time we appreciate your coming [Applause] you
