---
title: WWDC2004 Session 613
framework: wwdc
role: article
path: wwdc/wwdc2004-613
---

# WWDC2004 Session 613

## Transcript

Kind: captions Language: en good morning and welcome to session 613 creating application Buster's for apache and mysql is my pleasure to introduce to you thomas Loran from emic networks a partner of ours that we've been working with for the last six months in providing this cluster technology so without further ado Thomas come on good morning everybody that's to make sure that this thing works right here like Chris said my name is Thomas loriana in the principal systems engineer for emic networks where San jose-based company specializing in apache and mysql clustering we do have engineering offices overseas in oulu and helsinki finland so we're kind of an international company but we are located here in the Bay Area what I'm going to talk to you about is application clustering the concepts the technologies for like i said apache and mysql i'm going to talk about why we use load balancing and why we use clustering i'm going to talk about the various types of clustering i'm also going to talk about the various types of load balancing and how load balancing works now when you're talking to an audience like this there's a wide range of experiences so some of the early slides might be a little bit slow for people that are experienced in load balancing or some of the other cluster technologies but I think it's important to get a baseline technology so people understand what I'm talking about when I say load balancing people understand what I'm talking about for clustering because clustering can mean different things to different people so why do we use clustering and load balancing well basically there's two reasons the first one is redundancy people have got else la's that they need to meet you got to have up time for your internal and external customers redundancy is one of the primary reasons that people use cholesterin and load balance in the second one is scalability a lot of times people have hit performance walls on the single law box and they need to put more users on a particular application well how do you do that how do you make sure that your incoming client requests go to multiple servers and then how that's the load balancing part and then how do you make sure that the various servers have the same data consistently but on the back side and databases / to have a particular challenge for that so what are we talking about here what we're talking about is a multi-tiered approach now this is something that most everybody understands you got you know your web clients you got your application servers in the middle and you got your mysql database servers on the back end now sometimes these these layers can be blurred and you'll see that throughout the slides like a lot of times I'll be combining your middle tier and your backends here together and bear with me as I go through these slides that several times I don't have to say okay please understand that when you're talking about an Apache you're talking about the client being your web browsers are going to your application servers but then sometimes when I'm talking about MySQL you need to understand that the client in that particular case is your web app going to the back end but for space in the ways that the slides are laid out that this is the only time really that I'm going to put the three-tier architecture up there but but that's kind of what we're talking about like I said Apache clustering between your browsers and your application servers mysql cluster in between your application servers and your database back in and sometimes you can combine you know in amex case in particular you for very low nsmb people you have the option of combining your application servers with your database servers so one of the things that's pretty important is understanding what clustering is clustering means various things to different people you've got computational clustering application clustering and data store clustering I'm in a kind of under mention what each one of those are and then focused primarily on application clustering and data store clustering so consultation elestren there's a lot of sessions here today this week about computational clusterings and basically it's designed to seamlessly integrate resources like CPUs into a computational grid sometimes called high-performance clustering typical application be video rendering scientific and technical environments a good example as you can read up there is Virginia Tech that's not the focus of this presentation there are other presentations that talk about computational clustering in hpc this is a traditional application clustering and would traditionally what happens is hopefully you guys can see the little little tiny dot here you've got two applications running on two different servers here and traditionally there's an application heartbeat between those two servers now and go back here to this one reason I hesitated for a second is normally on my other slides I had IP addresses what would happen is your top server up here would have one IP address like a 192 168 1 2 the one down here we'd have a 192 168 1 dot 3 and in the center we would have like a virtual IP address that people would connect to those two and it'd be like a 192 168 1 dot 1 the application heartbeat determines which one of these two applications is going generally but not always only one of those applications is going at a particular time what makes application clustering different is the fact that traditionally and once again there's always exceptions I'm talking about generally here generally there's a single data store in this particular case I've got a next serve raid on the back back in that means and there are exceptions to this but generally speaking only one of those applications can be active at a time because you can't have to act two applications writing to your same disk or mounting the same volume at the same time so that's what happens with application clustering now datastore clustering is kind of a different concept and data store what we're talking about is the physical data the disks themselves sometimes it happens at the file level and this really has no awareness of the application so there are various volume replicators or some filesystem replicators things like that that will take your data that's physically on the disk and replicate it back and forth no awareness whatsoever to the application the channel this requires data replication the challenge with this is the applications hold data in memory now what that means why that's a problem is frequently when you're using or clustering your data underneath an application can change physically on the disk but your duplicate physically pulling from what's cashed in memory so you can have a case where things are out of out of sync so when you're running data store computations or data store clustering it's very very important that you have hooks that are built into your application that's where a company like my see like Emmett comes in because we have hooks where we can do the application clustering the data store clustering and make sure that the application is aware that the data physically on a disk has changed and once again I'll talk more about this it's particularly important with databases where you have the concept of order so datastore replication traditional file or disk replication one of the ways that you do this is a master-slave architecture what happens is you've got read rights that come down to a primary server then what that server does is it forms out the requests or replicates those over to other servers that are slaves those other servers sometimes but not always can be used as read-only replicas but you can't write to them and the reason being is data consistency that's why you've only got a single place where you do your rights and then you've got read-only on top of that one of the things one of the when you're talking about databases the concept that follows that is lazy database replication lazy database replication does use the same master slave architecture and it uses log shipping via peer to peer connection to maintain slave databases one of the big advantage of that of course it's easy to set up it has well it it's able to survive relatively high latency for replication and the slaves are read-only now the cons of course are the failover is not transparent for applications it's got limited scalability data can be lost that the master becomes unavailable for extended periods of times those are some of the cons of that data sort clustering what we're talking about here and this is actually master less peer-to-peer is not correct master less is the correct term for this so what happens is each one of you know is able to handle both a read/write and each one of your nodes connects your data and transports it to all the other ones now when you're doing this it's very very important that you have some kind of a consistency protocol so that each one of your notes has exactly the same data because if you don't what will happen is one of your nodes will read one set of data that hasn't been transported over to the other side and you get inconsistent views when you're dealing with a database that could be very bad so active database replication once again this is not a peer-to-peer this is a mass or less and what happens is all database requests are sent to all nodes in a cluster it ensures that the database requests are executed in the correct order and i'll talk more about the ordering problem the advantage here is the applications is the failover is absolutely seamless you get a single IP address where the applications go you the nodes are consistent at all times the cons are it's a little bit more complex to set up and you need additional networks at least an amex case what you need is an additional network on the back side to handle replication but I'll show that in another slide now what I'm going to talk about is load balancing now load balancing the problem that we're going to try to solve is how do you spread the load between the various servers well so you've got in this particular case what I've got here is some web clients that are trying to go to a generic cluster over here so the first request comes on in it goes to one server and you want to make sure that the next requests are replicated and goes over to the other servers well how do you do that well make sure that don't you okay the challenge with doing load balancing like this is how do you know that the server is up you don't want to send data blindly down here well here's a couple common ways that it's done one of the ways is dns round-robin and if you put this in the i'm a network guy so if you put this in terms of the OSI model a dns round-robin is what i call no less.i level basically what you do is you just blindly ship the information down the pike whether that particular servers or not layer three type checking that servers up is you do something like a very simple ping and layer 4 is you can check and scan it from particular ports is open so let's say if you're going after Apache you want to make sure that port 80 is open so not only is a server physically up but you want to make sure that that port is open and finally one of the other ways to do it is you can do it at an application layer where layer 7 where you will do something like an HTTP GET request as if it comes back with the right answer you know that the application is up because it's quite possible for your node to be up in the application is crashed so these are things that load balancers have got to take into account when they determine which one of those servers that they move over to ok so there's traditionally three types of load balancing DNS round-robin which I sort of talked about before hardware load balancing software load balancing so in a DNS round-robin scenario which is very very common what happens is and this could be very elementary for some people here but just bear with me so we have a level set here what happens is a web client does a request for let's say WWII sitecom it goes the request goes to a dns server the dns server responds back with an address in this case 17 100 105 then it takes that physical IP address at 17 100 105 and ships it over to one of the servers the advantage is for dns round-robin is it's very free it's easy let me play back up here and talk a little bit more about this like so what would happen is the next request would do the very same thing it would go up to ww my website com it would get a different IP address instead of 17 100 105 that we get 17 100 100 for which would take it over not to the middle server but would take it over let's say to the top server or one of the other one so what happens is in your DNS zone records you would have for ww my website com you would have four or five six seven different physical IP addresses the advantage is it's free very easy to set up it's very easy to administer however like I said before the con is there's no awareness of the servers being up or down there's no inherent ability to separate out services and that's one of the advanced features as well you can use different names for different services it has no way to say that an HTTP request goes here in HTTPS request goes there a database request goes here so can't handle any of that kind of stuff it's not very intelligent whatsoever so hardware load balancing requires a network device not necessarily a network switch they're very specialized devices to do this and what happens is you get an address that comes in in this case 17 say to 16 17 168 64 and then the load balancer apply some sort of a policy and I'll talk about what the policies are on this their various current parameters that the little hardware load balancer used to determine which one of these goes to it translates that address to a private address of one of these it could go to 1000 123 etc etc or hey said the first one comes in it gets translated goes to 1000 110 zeros are two for your next request 1000 three we back up here and talk about what some of the types of policies are and when I say policies typical ways that they do it are things like round robin now it's different than dns round robin because the hardware load balancer physically knows what server is up and if a server is down it simply drops it out of the rotation one of the other policies is weighted round robin here's a good example let's say you've got a new dual g5 running an application over here and you've got a low-end linux box running out pentium 166 over here what you can do is say is you can wait the g5 to be more preferable than the low end pentium box so let's say for every one request that the Pentium box gets your g5 gets five that's that's one of the weighted ones you can do things like simple love it's a first-hand first out latency but what you can't do and where Emma comes is you can't query the server for things like server load balance or server CPU utilization and parameters that are specific to an application that's running on that server that's one of the policies that emek has it's an advantage over a hardware load balancer like I said they're very very powerful the flexible you can separate out the services and they can handle special cases well William what do I mean by special cases and I'm not going to go into a lot of detail about what special cases are but they're things like ssl encryption things like persistence and what I mean by persistence is sometimes when you're running a web application of financial application you want to make sure that the exact same tuple or exact same client on the same port goes to exactly that same server before that that gets committed that's called persistence one of the other things that Hardware load balancers can do is they can do things called global load balancing where you may have a cluster in New York and you may have a cluster in Los Angeles how do you make sure that somebody in Chicago goes to the right one those are global load balancing things these are things that Hardware load balancers kind of specialized on they're usually deployed in pairs and they're very very expensive they're 15 to 50 thousand dollars each so thirty thousand to a hundred thousand dollars to get them set up they're pretty complex to set up and administer just a lot of policies rules and things like that and like I said they don't have as part of the policies they don't have any direct awareness of what your application load is so software load balancing an example is the emek application cluster what we do is we make sure that each one of our nodes gets a each one of our sessions and then using our policies inside there we will make sure that one of those gets accepted and the other to get dropped and in a second I will go into that and I'll explain how we drop those and how we actually run our flow this is on the front or the public network and we're talking about the load balancing later on the presentation I'm going to talk specifically about emich and how we do after we've load down south how do we do total ordered replication of the database on the backhand like I said the next one would go to the next server gets pretty redundant you can you guys kind of get the idea here so it's the advantages for a make application clusters or software load downs are in our particular case we're designed for HTTP and mysql policy based based on server application load it's very easy to set up and easier to administer I said on the con it does require a separate network on the back side you can be a virtual networking and be a physical one and also it's designed to support only HTTP in mysql and it doesn't handle the special cases currently in this version of code right now that we've got out it doesn't handle the special cases so let me switch gears that was kind of a high overview of what load balancing is in general and what replication is in general very broad brush on that let's talk about emic application clustering wow it's same technology framework and this is where I condense it I talked before about a three tier architecture well this this will always be your clients here and this is a cluster and I deliberately left it out whether that's going to be like an Apache cluster whether that's going to be a mysql cluster so several times in this presentation I'm going to talk about these web clients here remember that I'll try to make a specific reference whether that's a web browser or whether that's an application as your client that's talking to the database on the back end so what my a skill does is we apply apache load balancing on the front end then we can do my SQL load balancing and then we can do mysql replication and you notice one of the things i didn't put up here was actually apache replication that's currently not in this particular product because when you're talking about these kind of websites generally speaking you're talking about static content for a lot of your web data your your dynamic content is generally in your database that gets replicated so there are other mechanisms that you could since it doesn't change that often that you could actually replicate your web content between your various servers but remember to have a cluster you've got have load balancing and you got to make sure that your content is consistent it's databases to change constantly and databases have got the total order problem and that's what we're going to talk about so emek application cluster you get dynamic load balancing you get an active replication as opposed to lazy replication and you get fault detection and isolation so what do I let let's take a walk through here what we do is we configure our switches and we're changing the way that we do this right now between various versions but we take our interfaces and switches are configured to allow all packets from all clients to arrive on all nodes so basically everything gets flooded to each one of these servers generally speaking there will be a syn packet at the front and the syn packet has got to match our policy so the first server set up so that it will accept syn packet one the second one set up to accept two second one sets up three what happens is the other syn packets are dropped so that we only have one session going the initial sin set up goes to all the servers and then the other sessions are dropped that's how it gets flooded and then based on that policy we decide whether we're going to accept that particular flow or not accept that flow now let me give you an example here kind of scalability that we're talking about the first slide right here is this is an example of a standalone mysql server down here these are transactions per second these are fairly typical what we're seeing over some low-end three node clusters here but the relative depending on things like your network your hardware your application on the back side here this is what's a server unit so this is like half a server one server to server etc etc you see when we put my SQ or we put em application cluster on a note on a single node and we turn it on we really don't think much of a performance hit whatsoever and obviously we don't get any scaling because there's only one server when we go to two servers here you can see that we we don't quite double the load but we definitely get up there and then when we go to three servers you see that the the transit actions per second go up it doesn't go up exactly you know 141 but there is a linear relationship as we go up so it's managed off of a java application is aqua fide and this is an example of our client and later on in the demo I'm going to talk about that i'm going to show you this I'm going to connect up to a cluster I'm going to bring a couple nodes online each one of the nodes is represented by a particular glyph or an icon that has various colors for various meaning it can be active wear it handles all our requests you can be offline where it does not handle any requests whatsoever in other words totally dead I can be standby standby kind of interesting because what a standby server is doing is it's replicating all your database changes on the back end but it's not servicing any read requests so it can come online very very quickly because it doesn't have to synchronize any kind of a database so it's ready to go transitional it's sort of moving between the state so you'll see this glyph come up several times when you're actually moving between a standby and inactive or an offline and inactive status and a maintenance mode is where you take the database down for maintenance so I'm going to switch over to a cluster here then you can change the demo over here so what I've got here is I found I've already hooked up to a application cluster this is a prototype software we're running here for an OS 10 port I should say that we're in version 20 we are built for Linux we've got production customers running in version on Linux but several months ago we started an OS 10 port so what I'm showing now is the OS 10 port of our product which is prototype software I'll talk about when we're when we're looking at releasing it but it's not a build because that's already out there it's it's just a conversion over to a different OS what we're seeing here is node 3 up here at the very top is running an inactive status ah no two and three are in a cold status and we've set up a linux server that's running a script that's banging this for load so it God if I remember the script right it's setup it's putting about 350 clients are connecting on it we've got a lot of artificial things in the script to generate a hundred percent load here now what I'm going to do is I'm going to ctrl click on our first server here and I'm going to move it from an offline status to an active status first thing it does is it goes over into a transitional mode and as it starts to move what's happening now is the first server is sending its load over to the second one it's actually copying the database physically so you don't have to have that one set up it's physically copying that database over and now the second one is online now if you take a look at the chart here it took a little bit of a dip and then our CPU utilization should go right back up but our transactions per second let me move over here so we're not checking a load we're checking on our requests radar transactions per second our transactions per second went up here now you'll notice that this doesn't have the same scalability that I shown the other slide because we're running prototype software and our engineers are still kind of tweaking the code I just wanted to show that it does increase it's not as a larger increases we have on our Linux but um well we're still running a hundred percent load on two servers now because that's what our script is pushing we're actually getting more throughput these two servers ones at 96 ones at ninety-eight percent that's about as high as they'll get they usually don't go to a hundred percent but sometimes it takes a little while to level out the load because of the number of transactions some have got a timeout and the load balancer shifts other transactions there so sometimes it'll take you three to five minutes before they sort of level out at the same utilization now what I'm going to do is I'm going to ctrl click on our third note I'm going to bring that online now what's interesting here and while we always recommend deploying these in three is what's going to happen is one of the servers is going to go into a standby mode a special kind of standby mode where it's not going to be servicing out it's going to freeze the database as a snapshot copy it over to the new node that's coming online and cash all the requests that are coming in then when both of them have the exact same copy of the database what it's going to do is take the cached information that it was transmitting verify that everything is consistent and bring all three of them up on the same time so it's better always to have three than it is to so that you can always have one running as you're doing this because one does move into a maintenance mode or in the standby mode so as you can see what's happening is both of these have gone into transition state one of them is still active now this guy's gone offline and he's while the top guy is still servicing all the requests the guy on the right has now transferred his database over to the over to the one on the left he's got caught up and then he takes the information that's cached and he brings those together and well down here this is this was running the first line was a single box the second one with two boxes and now that's increased up to three so while we're still running our loads we go back here engine yeah r load overall for the cluster is still running 97% we've got the Linux box set up to just blast the bosses as much as it can take you'll see that our requests per second up don't pay that much attention to what the Delta is between the steps because that's code optimization that's specific to the OS 10 platform that we're still working on so that basically is the demo that I wanted to show you and go back to the presentation kind of interesting how one of them will go offline in then copy the database over or what while the first one is still servicing all your requests okay can we cut back over to the slides done with the demo all right thanks so let's talk about how EAC does the octave replication and I told you that one of the problems that we had was a two-phase commit I don't think that was actually in the slide lazy when people traditionally did active replication they had a concept the two-phase commit means your database right has got to be committed on one server and that's got to be committed on the other servers I acknowledged before it actually becomes let's say a final commit well that's very very slow so what we wanted to do the whole absence of why emic was created the problem that we were trying to solve was how do we how do we avoid this two phase commit and still have a lot of speed well we came up for our concept of apples active replication on explain how we do that so what we do is we've got a group communication protocol and we use a total reliable multicast protocol it's a token token based passing protocol a lot of people joke with me it's a token ring no no it's not token ring okay it's a token passing scheme that we use and we use the reliable multicast of course if your network guy you say well wait a second rule I multicast is an oxymoron it's not reliable well because we use the token scheme and we what we do is each one of our requests gets a sequence number that's assigned to that particular request and then we're able to keep everything in track the token gives a particular note permission to transmit and it doesn't wait for the for the event the request actually execute on each one that knows what we're trying to do is make sure that it gets queued in the proper order so that's what we're doing with a sequence number for each request we also have a mechanism for handling cluster for partition clusters and what I mean by a partition clusters let's say I've got five clusters in a node there's some kind of a partition on the back end and what happens is they're still there still connected up on the front end so it's being load balanced across the front but it's partitioned on the back so you've got two nodes over here in this group that think that they're servicing all the requests and got three nodes over here determining that they're still servicing all the requests well one of those two can't be right because then you've got a partition cluster we have a mechanism for handling that and shutting down what we call the node that isn't the quorum the quorum or the majority continues to operate and the other part actually drops off the other thing is we're actually degraded with MySQL lock manager so the problem that we're trying to solve is how do we have total ordered consistency so here's our front network we're doing a standard load balancing that we had talked about before we receive the tokens on a private network we assign an order request of them and then we multicast them out on the back side and you'll notice here that we've got a token right here on the top this just gives him permission to transmit and then each one of your request gets assigned a particular sequence number so the sequence number that each one of these nodes gets is one for example then we do our next operation the load balancer goes over let's say to the middle server he does a request that request is number two for example he gets the token so he has permission to transmit that and he multicast that out and then everybody says well the last one I saw was one I'm getting a two so I know that everything's in order it's got to be sequential once again the load balancers changes over to the third node it's got a request that comes in database replication request it's assigns it number three it gets the token it multicast that out to each one of the nodes so as you can see there's no two phase commit here and that's really what we're trying to do is solve that two phase commit problem now for your programmers out there's this makes a lot of sense here's the problem that we're trying to solve with databases it makes a big difference the order that you receive your particular request because as you're well aware you know three plus 4/5 there's a lot of different than 3/4 plus 5 so with databases you got to make sure that the order is exactly consistent now there are a couple things that go into making up that order you've got network latency and things like that but on each server each one of your threads could be operating at a different rate it could be different cpu load on each one of the servers so you need to make sure even though the request gets to that particular server how do you ensure that it gets ordered or executed in exactly the right order on different servers is a very challenging problem like I said what we do is we assign sequence numbers to it and we cue them for execution in exactly the same order as long as they're cute then we can free up and go to the next operation so I talked about fault detection and isolation one of the challenges that we've got we we on our back end we've got a secondary network that can be either real or virtual what happens is we use a token lost time out and a probe which is ping like ping is the best way to describe it we have a particular fail on we have a failure detected on the network here we detect that the token has timed out by the way the target that we're looking at to do this is 10 milliseconds so we're not talking you know 10 15 20 seconds or a minute or two 10 milliseconds is what we're looking at we get it to a token loss and then what we do is we'll send a probe down that probe fails because it can't ping that particular note it times out and detects a failure it does a switchover we bring up our secondary network which can be like i said before can be a tunnel network through your your public interface or it can be another interface another physical interface and this is absolutely transparent to your applications that are running so it continues to replicate your data and really doesn't miss much of a heartbeat whatsoever so this is the example that I was talking about before for fault detection and isolation I moved it over so I've got five servers here i have a fault that happens on a net back network and just for the sake of discussion will say that this occurred on both the primary and the secondary network I mean this is highly unlikely but I'm just saying that this is what would happen we get a fault there on both your primary and secondary networks you have a partition network and you'll notice here that the front side is active so the load balancing is occurring on all three of these each one of each one of these would be servicing a database request but the replication is only hat is working between these two down here and these three up here classic partition network well what happens is we have a partition or two in a partition of three our logic or our policies and decide that the three has got the quorum everybody negotiates a quorum the to understand that they're a minority because they know how many nodes were in the cluster and they simply drop off and you're running a three node cluster and this is how you avoid your your partition network once again the whole idea is to have absolutely consistent replicated databases under you know a very tight consistency it's not really designed for a lazy or loose coupling where it doesn't matter if your read can be you know seconds behind or minutes behind but for very let's say financial applications or things like that where each one of your nodes has has to have exactly the same data at the same time that's really what am i cap l'occasion cluster is designed to do one of the other things that is really kind of an advantage compared to let's say some legacy solutions that are out there so we have a substantially lower total cost of operation this is using an example where somebody else's product will use Intel hardware and our product is running on Apple X serves the assumption is that we're using four hundred users and this and this includes the hardware the software the cows the maintenance everything in a total package and you can see we're substantially cheaper the other solution to do the same thing would be costing you over half a million dollars and our list price on doing this is a little bit over 32,000 dollars so we're going for a very low total cost of operation so kind of in summary there when I'm we do mysql and apache clusters not computational cost clusters where the hybrid approach where we do data store and application clustering we bring those two together our focus is accurate replication where you need very very tight replication between your various servers where you need to scale your databases very solidly very low total cost of operation the other thing that I'm kind of looking to you guys for is our release candidate program like I said we're on version 20 on other platforms but we're looking for beta testers and release candidate testers for our code to run on OS 10 we're looking at releasing that sometime late 23 early q4 with production is scheduled sometime around the end of the year or early part of the year product managers get very upset when I actually you know nail them down with dates you guys are familiar with that I mean there could be a bug that you know somebody determines during the actual release candidate program to slip those dates but anyway we're looking at q4 and early q 1 2005 for production code on OS 10 so if you're interested in that somewhere up here it had my email address on there or you can go to our website at WWE Network amor my email address is Thomas dot Loran at application cluster calm one of the things about having an early stage company is sometimes we really can't decide on what our name is so my email address is actually Thomas dot Lauren lor am at application cluster calm so if you guys want to be part of the release candidate program and test this as we're running out production release I'd be happy to talk to you and getting this set up in your environment so that i am going to work with chris who's going to handle any kind of app Apple questions I wanted to see if there any questions from the audience here on mysql and apache clustering