WWDC2012 Session 214

Transcript

Good morning, everyone My name is Ben Trumbull and I'm a manager for Core Data and I'm here to begin the Core Data Best Practices session. And today, we're going to talk about a number of topics We're going to talk about concurrency, nested contexts and then I'm going to bring Melissa Turner on stage to talk about schema design and search optimization So, as part of these ?? we're going to talk about using Core Data with multiple threads, sharing unsaved changes between context, debugging performance with Instruments tuning your model, and improving your predicate usage. First up, concurrency: So, when using Core Data, or really any modeling objects, there are some challenges you're going to face the first is obviously thread-safety look at some issues with transactionality when you have a bunch of changes together and of course you need to balance that with performance So, in the past, a lot of people have done something like this; they had a bunch of different contexts together and used performSelector: to route, say, a merge notification or another message onto the main thread or a specific thread to get to a context. This makes us a little sad, though. So, in Lion and iOS 5, we introduced some new methods and some new concurrency types for Managed Object. And instead of having to trampouline through performSelector:, you can use performBlock: and performBlockAndWait: and it's going to look a little bit something like this: when you create a managed object context you'll specify what kind of concurrency you want it to use; it will manage that itself and then use performBlock: to route it tasks. So, there are three concurrency options you can use with Core Data: the first one is, you tell a managed object context that you want it to be bound to the main thread. and this is great for interacting with view controllers and other aspects of the system that are bound to the main thread or don't really know much about Core Data and its concurrency. And then for a lot of your background tasks, and your own work, you can use private queue concurrency. And finally, there's the confinement concurrency type, which is what people have been using in the past before we introduced these new options So, for the confinement concurrency type, you're basically required to have a separate context for every thread and a managed object context can only be used on the thread or queue that created them. This is the default, legacy option. So, with the confinement type, everything is going to be serialized against your ?? and you can use either a serialized dispatch queue or an NSOperationQueue with a maximum concurrency manually set to one in addition to a specific thread. So here I just want to point out that Core Data isn't using and thread-local state and we're really interested in having a single control flow; we're not really as focused on whether or not dispatch queues work with multiple threads, or how that's happening underneath the covers. So, thread confinement is pretty straight forward; it's safe; it's efficient. The transactions are obviously scoped to the managed object context so something else gets to interfere with it. But it does put a burden on you to manage all these issues. So, in particular, tracking which context goes with which thread, potentially keeping extra threads around for background tasks, and then all of the special behaviors that Core Data uses to integrate with view controllers, Cocoa bindings, undo management. We're going to have to infer from context whether you created the managed object context on the main thread and those things are driven--those we call "user events", typically--are driven by the run loop of the application. So, in contrast to confinement, private queue concurrency maintains its own private serialized queue and you can only use it on this queue, and you do that by setting up blocks as tasks, and enqueueing them using performBlock: and performBlockAndWait: Now, within those blocks you can use the managed object context API normally. And I just want to really emphasize that in this case, the queue is private and you shouldn't yank it out and interact with it directly. If you want to, you can dispatch work to your own queues, you can dispatch_sync at the end of those blocks. And there are a number of advantages to this, It lets the managed object context maintain which queue it's using and handle whether or not its in the right state, the right thread and other threads can easily interact with that managed object context by calling performBlock: unlike with the confinement concurrency type; those other threads really can't message that managed object context at all. And these can be created from any thread, and the queues are going to be much more efficient than keeping extra threads lying around in the background to do their tasks like background fetching. And the third type is the main queue concurrency type. This is going to behave very similarly to the private queue concurrency type, only the queue is obviously always the main thread, and non-main threads can just call performBlock: on that as well. And it will integrate all of those behaviors that I talked about: undo management and other life cycle events with the main run loop. So, what that means is that when you create a managed object context with the main queue concurrency type, your view controllers and other things can just message it directly; they don't have to know about all of these different performBlock: APIs, and it's very easy for other tasks that you have in the background to just enqueue performBlock: on it and have those then update view state. So, just sort of a diagram of what I mean going on here is, a background thread can enqueue a block directly, but the view controllers can just start using managed object context API. So in this way, Cocoa bindings, for instance, doesn't know about concurrency types or performBlock:, but it can just work with the managed object context the way it always has, and you have have background threads and other queues enqueue messages to happen on the main thread context in that way. So, I mentioned that we had these notions of user events, and for the main thread, that's going to be tightly integrated with with the application's runloop but for contexts running off the main thread, either in a private queue or on your own thread, Core Data is going to defer a bunch of tasks and then coalesce work later on; this is the change notification--coalescing changes for notifications, delete propagation, setting up the undo groupings; stuff like that. And, for the most part, on background threads, we consider this to be the time in between calls to processPendingChanges:. So, a couple of useful points for all the concurrency types is that managed objects are always owned by their managed object contexts and that Object IDs are a great way to pass references around between contexts because they're going to be safe, immutable objects. And something else that's a nice point is that retain and release are going to be thread-safe on all Core Data objects, everywhere, all the time, without exception. They should be thread-safe on all Cocoa objects, but your mileage may vary on that one. That means that you can actually retain a managed object independently of its requirement on the managed object context; you just can't necessarily use it directly. So, some good times for you to pass around updates to other contexts or to update the views are going to be with these NSNotifications that Core Data provides, with the ObjectsDidChange notification and the ContextDidSave notification. and you can refresh other managed object contexts pretty easily, after they save, with the mergeChangesFromContextDidSaveNotification. And here I'd just like to call out that you're responsible for the thread safety of the managed object contexts receiving this message but you don't have to worry about the notification data that's being generated here; Core Data will manage the thread-safety of that information. So, you just have to maintain the rules that we've outlined in the past on the receiver of the merge method. And, when you're inside some of these notifications as the observer, you can find out some useful methods of taking a look at the state of what's changed in the merged objects, something that we added last release was changedValuesForCurrentEvent, which will give you the values that changed since the previous call to savePendingChanges: and then some older methods, changedValues and committedValuesForKeys will go back to the last time the object was saved. So now I'm going to go into a little more depth about these performBlock and performBlockAndWait: methods that I mentioned earlier, and our challenge here is to find a way to pass work to other threads, these managed object contexts running on their own queue or the main queue, and to sort of demarcate the actual group of changes you want to be coalesced together, whether it's for an undo grouping, or for validation, or potentially to save, as well as a convenient way to integrate with all of the other APIs on the platform. and that's part of the reason we chose blocks. So, performBlock: is an asynchronous request to enqueue this. We consider this its own isolated user event, and it also includes an autorelease pool. I really want to call out, that in all of these methods, it is very illegal to throw an exception outside of the block, so if you do have exceptions, please catch them and resolve them inside the block. And there's no support for reentrancy in this performBlock: method And by that what I mean is, when you call performBlock: on a managed object context, and within that performBlock: call, you call performBlock: again, you're basically just getting the same effect as if you had iteratively called performBlock:. So, this is an asynchronous call, and all it's doing is enqueuing up a task to be happening later. So, in contrast, we have performBlockAndWait:. This is synchronous, it's very lightweight, we don't consider it to be any kind of event, It doesn't even include an autorelease pool. But what it does do is, it will support some reentrancy, so if you call performBlockAndWait: from within another performBlock: you'll basically get them nested; they'll be executed immediately inline as opposed to enqueued later So, this is very convenient as long as you're just working with one managed object context for these blocks. So, these APIs are very fast and lightweight. The performBlockAndWait: API is on the same order of magnitude as valueForKey:, for instance and the changes there, from Core Data's perspective are going to be scoped by the block. So, however large or small you make the block, it's going to be sort of one self-encapsulated change set. So, when you're working with data between blocks, like I said, you can retain objects independently of their threads and pass them between blocks, but object IDs are often going to be useful, you can rematerialize them into managed objects when they get inside the block using managedObjectWithID: and this will reuse whatever cached state you have around, perhaps at the persistent store coordinator level We keep a cache there as well as at the managed object context. So if the data is already in memory we're not going to go back to disk to get it and this lets you use object IDs is immutable objects to be passed around and not worry too much about the thread-safety of your references and then when you get into the block you can rematerialize those into managed objects But you can on occasion you'll find it useful to pass managed objects around you just can't actually look or use them; when you do so you can just retain them. And of course block variables are a great way to pass out results. And a lot of our APIs return NSErrors, so it's very important to remember that these are autoreleased, and as I mentioned performBlock: includes an autorelease pool, so you'll probably want to either handle or retain the errors before returning from your blocks. So, a simple example of how you might use some of these APIs. Here we have a context, and it's synchronously calling performBlockAndWait: to execute a fetch request that's been captured by this block from some code further up and, if we don't have an error, then we just ask the array of managed objects to give us back its object IDs, and we return those out of the block with a block variable. So, as I mentioned, the queue is often going to be very private to the managed object context, and we don't want you changing anything about it, so if you need to, and you're using your own queues as I'd expect, you can just simply at the end of the work block that you passed the managed object context, enqueue another block back onto your own queue as the callback to let it know that it's done and process any results. But there are a number of other ways that you can either coordinate with your own queues, or other queues in the system, and dispatch semaphores are one way of doing that. You can create a semaphore, and then at the end of the block, signal the semaphore and then in this particular code snippet, the context is asynchronously performing this block and the code that is calling perform: here is actually waiting until that is done on the semaphore. And then, something else that I'd like to give a little shout-out are dispatch groups, and if you haven't used them they have some very interesting behaviors. And you can use them to organize some pretty complex depedencies between a variety of queues and blocks between them. So when you use dispatchgroupenter, it's a little like incrementing a retain count on when the queue will be done. And then, the worker blocks can call leave to decrement it and when it ends up getting back down to zero, conceptually, dispatch_wait will return, or dispatchgroupnotify will enqueue a block that you passed it onto your own queue. So what this lets you do is basically, you don't actually have to know how many waiters you want to float around, you can just call dispatchgroupwait as you add more work or as you decide to build in these dependencies and then have them call dispatchgroupleave. So, this is a very simple example. It's very similar to the semaphore example. This becomes more interesting when you have more queues involved. So, now I'd like to move on from concurrency to talk about nested managed object contexts. And in particular the reasons you'd be interested in nested managed object contexts are going to be passing objects around between managed object contexts and implementing something like an asynchronous save. So in the past, working with managed object contexts, you can push and pull changes that have been saved between contexts and use the merge notification to do that. But passing unsaved changes between contexts, or having them really work with unsaved changes can be very difficult. And similarly it's very difficult to break up the save operation to be asynchronous. So here, for a nested context, the parent contexts are going to act kind of like the persistent store, from the perspective of the child contexts. And the child context is going to see the state of its objects as they currently exist in the parent. Children will then inherit unsaved changes from the parent whenever they fault things in or they execute a save request and they'll marshall their saves in memory. So instead of saving back to disk, the children will just turn around and save to their parent context. So it looks a little something like this and the child doesn't know that it's not actually talking to the persistent store it's just talking to a parent context and the behaviors are going to be very analogous in the way that saving works, and faulting. So, in this way, peers that all inherit from the same parent context can all push and pull changes between them, and you can implement an asynchronous save by setting up the parent context to have a private queue and having the child contexts, typically on the main thread, save into the parent context, and then tell the parent context to save. And one of the ways you might leverage that is something like a detail inspector. So the detail inspector will inherit the view state as it is in your main context. So for sharing unsaved changes, when you save the child context, they'll just push up one level, and then you can pull those changes back down using a fetch or the merge notification between child contexts, or by calling refreshObject:. It's the same way you would with not-nested managed object contexts. For an asynchronous save, when you save the child, the parent context gets those changes and holds onto them until it's told to save and the changes won't be written to disk until the root most parent calls save. So that would look something like this, where a parent context has a private queue concurrency type so it will execute requests asynchronously and the child contexts get set up and create a reference to this parent context so when the child saves, it pushes its changes up to the parent and then here, it enqueues an asynchronous block to tell the parent that you want the parent to save. For inheriting changes in the detail inspector, you just create a child context for the detail inspector. and if you decide to commit the changes within the inspector, they'll get pushed into the parent, which is probably going to be something like the main context for your view state and anything you do in the child context for the inspector, it's just going to incorporate the current unsaved state in the parent and you don't even necessarily have to do anything special if you decide to cancel out of the inspector; you can just throw away the child context. So, some important things to remember. is that saving with the child context is only going to push the changes up a single level, but fetching is going to go to the database and pull data through all the levels. Keep in mind, though that in general, Core Data isn't going to change any objects that you already have out from underneath you so if you fetch an object that you already have, you will see it's previous state, so if you say, yeah, it's been dirtied, we're not going to blow away your changes. We're simply going to keep, in the fetched results, the reference to that object. And you can call refreshObject: if you want to get new values for it. objectWithID: on a child context will pull from the fewest number of levels necessary to get that data so it might go to the database, or it might only go up a single level to the parent. And, all parent contexts must adopt one of the queue types for concurrency. So they can either be main queue concurrency type or private queue concurrency type but we don't support them with the legacy confinement concurrency type. Child contexts depend pretty heavily on their parents, so the parent context really should not do blocking operations down on their children. By this, the children are going to call performBlockAndWait: and do a lot of operations for you. For instance, executeFetchRequest: on a child context internally is going to turn around and ask its parent context to do part of the fetch and then pull down those changes into itself. So what this means is, there's sort of naturally a dependency there, and if the parent contexts turn around and call performBlockAndWait: on their children, you'll basically end up deadlocking, because you'll have all these queues trying to synchronously wait on each other. So in general, you should imagine that requests are going to flow up this hierarchy of managed object contexts finally to the database at the root, and results are going to flow back down. And now I'm going to bring Melissa Turner on stage to talk to you about performance. Thank you. [Applause] Thanks, Ben. So, performance. How do you know when you've got a performance problem? How do you figure out what you need to do when you've got a performance problem? Lots of questions. The first stage, when you're starting to sit down in front of your application, "Is this thing ready to release to my customers? Is it performant enough? Are they going to be annoyed with me? Are they going to file bad reports on me in the App Store or are they going to give me five stars?" is to start asking yourself questions about the application. What environment does it run in, and have I designed it to be compatible with that environment? What should it be doing, and are the "shoulds" and "dos" compatible? What kind of things do you need to know about the environment? Well, actually, very little nowadays. As long as you're using the Apple-supplied frameworks, things like libDispatch, then we will take care of making sure that you're doing things properly from, say, the confinement standpoint, but you will need to do things like design for your network environment. If you have an application that goes out, use the NSIncrementalStore APIs to build a store that talks to a Web service, you probbaly want to make sur ehta whenever your users triggers an action that will require going out and talking to that Web service it doesn't block the main UI of the application. You'll need to think about stuff like that. That is a performance issue. You'll need to think about what is sufficient performance versus what is optimal performance. Sufficient is, your application gets up and gets the job done. Optimal is, it really "wows" your user and allows you to spend more time doing interesting things in your application because you're not wasting cycles doing things inefficiently. One crucial point to remember is that if you're building an application that supports multiple platforms, test on the minimal configuration. This cannot be emphasized enough, because if it works on your minimal configuration, it's going to blow people away on all other platforms. What should your application be going? You should know this; you've written it. You know things like, well, it opens documents. If you open a document, there's very little way to get around it, you need to do file system access and load at least some of the data so you can show it to the user; that's what they're expecting. If the user instigates a network access, it's the same thing. Know when the user is accessing the network and how' they're accessing the network so you don't do things like accidentally go out and fetch the same piece of data three or four times. And you need to know what kind of random processing your user is likely to kick off: calculate me some transform on an image; scale it; apply a filter. Do something interesting like that. These are things you know your application can do, and you should expect to see them in your performance analysis. And then there's what the application does do, stuff it does automatically. You have a data set that you need to go out and check periodically to see if there's new data on your Web service. That kind of thing happens automatically; you should build it into your calculations. Try not to do it when the user has kicked off that image transform. Does it post notifications? You should try to do that in some unobtrusive way using our APIs that will make it all happen nice and smoothly. And if for some reason you want to calculate the 2438th digit of pi, Try and do it at 3 o'clock in the morning on a Friday when they're not likely to be using the application. How do you figure out what your application does once you know what you think it should be doing? Measure it. Measure, measure, measure, measure. This is where everything starts. Figure out where your application is actually spending time so you don't end up spending two weeks optimizing what turns out to be one percent of your application's workload. It's much better to spend two weeks optimizing fifty percent of your application's workload. Start with the Time Profiler in Instruments. This will tell you exactly where your application is spending all of its time, method by method. There's also the Core Data template in Instruments. This will tell you when Core Data is touching the file system. We have a template that contains instruments for fetching, for saving, for firing relationship faults, and for when we have to go to the database because the data we're looking for is not in the cache. And there's also the com.apple.CoreData.SQLDebug default. If you pass this to your application when you launch it, or have in your defaults write, it will cause Core Data to print out all of the SQL that is being sent to the database, and you can have a look at that, see what you're sending to the database, look at the SQL that's being generated, figure out if this is really the SQL that should be generated in that case, if you're doing too much work, doing too little work, or doing too many trips to the database, this kind of thing; this default will tell you that. Many of you have probably heard this before, because it's a very common phrase in the real world, if you're building anything with your hands. Measure twice. Cut once. You cannot uncut a piece of lumber. And it's less important in the virtual world, because we have SCM systems. It's always possible to revert to yesterday's build. But the thing is, you can't get back the time you have invested going down that false path. So make sure you're fixing the right thing before you go off and fix it. For the rest of this presentation I'm going to do a series of demos, or I will be having my lovely assistant do a series of demos that are based around a table view. And this is primarily because table views are easy to visualize; if I say there's too much data being loaded, you can get a grasp of what that says. If I say there's too little data, or the wrong data--it's badly formed-- you can get an idea what that means. But the lessons are generally applicable to anything that's going to be loading and processing data from a store. Just as a disclaimer: the demos are specifically chosen so that they have performance issues that are visible on stage. Any performance problems that you have in your app will probably be a bit more subtle, but they'll have the same basic patterns. In the beginning, there was a table view. You know, your customers are not going to pay you for this because that's not terribly interesting. You need something, and in my case, I went on vacation. Those of you who are familiar with this picture will probably realize I was in Rome. and that this is a picture of the Colosseum; it's an architecture picture. These are all pieces of information that I want to build into an application that displays my holiday photos. My first pass is going to be to take all of those pieces of information that I've got and combine those into an object that I can use to back my table view. Call it a Photo object; it's got a label: "This was taken in Rome." It's got a blob that is the photo bytes, some tags: architecture and Colosseum, and a timestamp, when the photo was taken. At this point, I'm going to bring Shane up on stage and he's going to see how well that worked in a first pass. Hello, my name's Shane ??? and I am a QA engineer with the Core Data team So here we have the first demo that Melissa mentioned; this is version one of the photos application. As you can see this is simply mapped over, a simple Photo entity. It's a single-entity application. And when you click on the record, we can see the photo. So this works as promised. Now what we're going to do is hook this up to Instruments and get some measurements. Now for those of you who haven't used Instruments before, I'd like to show you what you see when you first launch it. What you'll notice here is that you get a sheet with all of your instrument templates. In our case we're going to use the iOS Simulator. Off to the left you have some groups which allow you to target a specific platform. OS X, iOS, or the Simulator. You want to keep in mind when you're using the Simulator what Melissa mentioned earlier about your environment. This is actually a simulated application, so while it looks like iOS, it's running on our development hardware so we don't have the same constraints that we would have if we were using a device, such as memory, processor, and disk space. If you select the Core Data template you will get the instruments that Melissa mentioned earlier: