appliances for cheap


peter wilson: ok. good evening, and thank youvery much to everyone for coming this evening. i'm peter wilson. i'm an engineering director withgoogle here in kirkland. and it's my pleasure tonightto introduce shiva, who's going to be talkingthis evening. before we do, i have someannouncements i've been asked to make.

firstly, the restrooms arein the hall outside. secondly, we're going to bevideotaping this, and putting it up on google video, whichis one of several great products written herein kirkland. so this will be on google video,just the presenters, not the audience. so if you don't wantto [inaudible] you don't need to worryabout that. we encourage you to hangaround after that

presentation's over and continuethe discussion with myself and shiva and a bunchof the other engineers from google who will be here. and then lastly, we have threefuture events coming up. on june 7th, jen fitzpatrickis talking about user experience at google. on june 21st, ellenspertus is talking about a study of orkut. and on june 28th, we're talking

about security at google. so i'd encourage you both toregister for these, and also please forward the invites thatyou got for this event to any friends or co-workers youthink might be interested in attending any of those events. so that's it for theannouncements. the talk tonight, buildinglarge systems at google-- it goes without saying reallythat google deals with huge amounts of data, and ahuge number of users.

and so shiva's going to talktonight about basically how this all works. and how we do what we do, andhow we make it scale. shiva is an engineer directorand a google distinguished entrepreneur here at kirkland. google founded this office here,and has been by far the [unintelligible] thati've been here. he's been responsible for manyof google's advertising products, and the googlesearch appliance.

prior to working at google, heco-founded gigabeat start-up which was purchasedby napster. he has a bachelor's from ucla,and a phd in computer science from stanford. so please join me inwelcoming shiva. narayanan shivakumar:thanks peter. so before i get started, i wantto a quick show of hands. how many of you have seen a talkby a google engineer in the last nine months?

ok, that's great. that's not too many. because we do interusesome slides. so i just want to make sure wedon't spend too much time on certain slides. so i'm going to talkabout building large systems at google. it's really in three differentcomponents. we have a bunch of googleproducts that you've seen, in

search, in advertising, inbuilding email, building talk, and a whole bunchof other things. and typically, all theseproducts, they tend to be on top of a very large stackof [unintelligible] systems, and some[unintelligible] the client. at the bottom of the stack, wehave computing platform, which is a whole bunch of machines ina bunch of different data centers, which have beendesigned and built up over

time to work with all thesedifferent [unintelligible]. i'll be talking moreabout this. and by the way, if you have anyquestions, just stop and interrupt me. because i know what my slidessay, and i'll get really bored if i just talk to my slides. so just talk, interrupt me. i'll be happy to talk more. the mission that google has hadfor the last seven years

is to organize the world'sinformation, and make it universally accessible as wellas to make it useful. one question that peopletypically ask is, how much information is there? the answer is, a lot. so if you just look at justthe web, and how it's been growing over thelast few years. i remember when i started atgoogle about five years back, when we had close to maybea billion documents.

and over the last few years,it's going to be a lot larger than that. and if you just take away theweb, there's even a lot more than that, in terms ofyour personal email, your files and databases. and we've been expanding intoother areas as well, in terms of video, in terms of radio,print, et cetera, as well. so in terms of products,you've seen a bunch of these things.

initially, we started a fewyears back with the web. and our charter was tosay, let's not do anything but the web. so that was the initialcharter. and over time, we kept expandinginto more and more media formats, in terms ofimages, video, and into enterprises, into scholarlypublication, maps, and recently, into areas likechat, and voice [unintelligible]

as well. so this is a list of all thegoogle products that have been built up over time. if you just look at the listof products we've launched from this office right here inthe last year and a half, it's quite a bit. so there's a mix of products ina bunch of different areas. there's a lot of work goingon in search, in crawling, in indexing.

so one of the products we'velaunched in the last year is google sitemaps, which is amechanism through which webmasters tell us what theyhave on their site, as well as get information from us as towhat we believe we know about their site. then there is a bunchof client software. for example, there's googlegadgets that launched about a couple of weeks back. and this is part ofa desktop search.

google pack is another onethat launched about three months back. there's a lot of work going onhere in terms of google maps. for example, a lot of therendering for the geographical information of maps, as wellas for specific countries, [inaudible]. google video, the[unintelligible] and most of the functions was better workingand we are expanding into more of those areas.

talk and google chat and avariety of other things, have been primarily built from thisoffice in just the past year and a half. so every one of these partsthat we have is hard. because there are more andmore users using these different pieces ofapplications. and every one of them has moredata, and every one requires more results, and betterresults with time. so for example, if you pick anyof our products, you want

to make sure that by analyzingwhat users are doing, we want to be able to give them betterresults over time. so every single one of theseproducts keeps expanding in terms of the amount of data weprocess, as well as the number of users who actually wantto use this data. so the overall goal thatwe have for systems infrastructure, that is, thepieces of software, of the stack, that is necessary tosupport all of these different products, is that every fewmonths, every day, we want to

keep building a larger scale aswell as higher performance infrastructure to supportthese applications. because everything is growing. and also, we want to make surethat we don't go bankrupt. so we focus on making sure thatthe price per performance is actually gettingbetter with time. and the ease of use as well. we do not want to have everysingle application built up entire parts of the stack.

so we want to make sure theinfrastructure built is scaling as well as supportingmore users. and it's easy for the folkswithin the company to deploy more and more productswith time. so any questions so far beforei get into the depths of systems infrastructure? ok. so i'm going to talk aboutthree different thoughts over here.

so just a reminder of this. so google products,[unintelligible] systems, which i'm going totalk about the hardware computing platform later. but in terms of theinfrastructure that we have, i'm going to talk about googlefile system, which is a mechanism- it's a really a largedistributive file system at which we throw lotsand lots of data. and then we have a couple ofcomputation models as well as

storage models that we'll betalking about as well. so gfs is yet anotherdistributive file system that we built up in thelast few years. so the question is, why did wewant to build a modified system, as opposed to reusingsome existing file systems? as it turns out, we have a lotof flexibility in what we do. we control our applications,we control our operating system as well as the machineson which we own these things. so based on this observation, welooked at a whole class of

applications and realized, thatsome of these things have large amount of reverb[unintelligible]. as well as, we need tohave reliability over many, many machines. and we want to make surethat it runs across multiple data centers. and also, the actual units offiles are actually pretty large inside this one. so we need to have a bunch ofapplications that can store

huge amounts of data across lotsof machines, as well as in multiple data centersas well. so that's the reasonwhy we started building google file system. so the way in which itkind of works is, there's a gfs master. and the master is what everysingle application talks to. and the master controlsa whole bunch of these chunk servers.

so chunk servers are,essentially you can think of them as machines which havecertain fractions of content that is localized to thatparticular machine. and what the master doesis it takes any file that you want to store. let's say you want to storea terabyte of information. it takes a file, and breaksit up into chunks of data. and says that you, machinenumber 55, you will be responsible for storing thisparticular set of chunks.

and the other component ofmaster is to make sure that it's replicated acrossmultiple machines. so if a machine goes down, andwe know that machines will go down, then at least theapplication can survive by talking to another bunch ofmachines for example. so the master essentiallymanages the chunk to the location mapping, and makes surethat the online systems are scaling in terms of loadbalancing as well as redirecting trafficto the machines

that are least loaded. and also, when a machine goesdown, then the other chunks will be sort of [unintelligible] by transferring dataautomatically by being a mirror of themselves. so in terms of the gfs usageat google, we have a lot of these gfs servers. and we have all ofthese clusters. and these came to very largeamounts of data.

so typically, the clusters havea thousand machines, or a few thousand machines, and thenumber of clients that talk to the master are pretty largeas well and are growing with time. and you can be pretty impressedwith the amount of that goes throughthese systems. and every one of these thingsis part of our, say, the ads pipelines, or the searchpipeline, or any other product we talked about, in terms ofbeing the core storage on

which we store all our data. audience: question. narayanan shivakumar: yeah. audience: is there onemaster per cluster? narayanan shivakumar: so thequestion is, is there one master per cluster. so, it depends. so in a bunch of cases, you wantto make sure that there are a bunch of masters.

so that if a master actuallygoes down, you could reliably hang off another master's one. but you want to make sure thatthere are multiple masters that are really mastersat the same time. because you don't want to havethem controlling which chunks are going to which machineand making different kind of decisions. so you can imagine there being aleader election that decides which, out of all these[unintelligible] machines,

which ones do i want tobecome the master. and there will be a masterhandoff process. some of these things can beautomated, some of these things are more manualas well. so for example, if a machinegoes down, it could be that you may have to restart a bunchof master machines so that they do take overfor the actual task. but you also want to make surethat when a machine goes down, there's enough of the systemstorage so that when another

master comes along and picks itup, it would have complete state and can do some thingswith that data. audience: but are there multipleinstances of gfs? narayanan shivakumar: are there multiple instance of gfs? yes. there are lots of gfs cells. and different applications couldrun different gfs cells. and there is actually a lotof work where there aren't

standard gfs locations wherepeople can actually store data and [unintelligible]. so depending on the application,you can either choose to use an existing gfscluster, or make up your own gfs cluster as well. audience: are the mastersself-managing, or if you have an unlimited amount of storage,[unintelligible], something like that? narayanan shivakumar: so thequestion is, are the masters

self-managing? the answer is, you wouldlike them to be. and in the case that they'renot, because there could be bugs or because something hasn'tbeen built up, you want to make sure that you haveenough reliability in terms of pulling a master from oneset of machines to another set of machines. audience: so is the leaderelection algorithm more of a [unintelligible] algorithm?

narayanan shivakumar: so thequestion is, what is the leader election algorithm? and the answer is i don't knowthe answer to that question. but there are lots ofdifferent options. and the real issue is whetheryou want to have leader election in the contextof one data center, or one per data centers. and depending on how muchnetwork [unintelligible] you expect to happen, you maychoose versus the other.

and ultimately you've gotto look at how often do machines go down. and how often do the mastersactually go down? because i can actually say thatthe master could be in a much more reliable machinethan other machines. these are options that are madeon a pretty daily basis. audience: but in theory, anymachine in the cluster could become a master? or only selected machinesare eligible?

narayanan shivakumar: so thequestion was, can any machine again, it depends onthe application. there are certain classes ofmachines which are certainly more reliable. so it depends on, again,the precise context. you may not be willing to losesome data in some cases. again, depending on how muchmoney you want to spend with this, how much reliabilityyou want. that's why looking at the priceper performance number

on a per application basisis very important. for example, we don't wantto lose any logs data. in that case, you may be willingto invest in a lot more money to have much bettermaster election algorithms, or a much better machines in anyof those [unintelligible]. on the other hand if it's mypersonal gfs cluster, maybe i don't care as much. but again, the application canmake its own trade-offs. but you want to have enoughlibraries, functionalities,

and building blocks so that whenan application chooses to do one versus the other, theyhave enough flexibility. audience: what's the maximumrange of chunk servers per master pool? narayanan shivakumar: what isthe maximum range of chunk servers per master pool? the answer is, i don't know. so it depends again. so you can actually lookat the number of--

so let's say that if atypical chunk size is around 64 megabytes. that's a suggested number. but you could arguably havechunk servers that process chunks that are largerthan that. so maybe the questionto answer is, how much is a storage-- i think one interesting questionto ask is, given a thousand machines, howmuch can you store?

how much reliabilitycan you get? and how much read and writethroughput can you get? and that depends again on howyou want to configure it. there will be some defaultconfiguration, but these are things that you can electto change as well. does that make sense? and again, the key is to makesure that we have enough infrastructure so that peoplecan make different choices, and there's enoughflexibility.

and you need to have the rightbuilding blocks on the way, so people can choose theright numbers. so i'm going to move from gfsto a couple of other systems on top of gfs. so gfs has been aroundfor a while. and it's availableout of the store. and there's anecdotes thatwe've lost less than 64 megabytes of data, ever,since gfs was deployed. so now that that's a reliablestorage mechanism, we want to

start processing data and doing interesting things with it. so one of the exampleapplications, which runs on top of gfs and processes hugeamounts of data and does interesting things withit, is map reduction. i'll be talking aboutbigtable as well. bigtable is more of a look-upservice rather than a log structured, let me storeall my data kind of [unintelligible].

i'll get to that ina couple slides. so map reduction is a way bywhich you can have a lot of very complicated applicationkinds of computation happening on large amounts of data. so imagine that we have awhole bunch of webpages. and what we would like to do is,we would like to count up the number of words thatoccur across all the different webpages. so what the map phase will dois, it would try to take all

the records. so in this case, thepages will be-- imagine them to bestored on gfs. so what you can do is, youtake every single page. stream them throughmap reduction. and the mapping phase wouldessentially [unintelligible] key value pairs ofthe form word to the url of the picture. so essentially, it translates,or maps, from one view of the

data from another view of thedata, based on what you would like to aggregate. so in this example, the mappingphase would take all the webpages and create a keyvalue pair, of key being the word, and the value beingthe url of the key. production, what it does is ittakes a whole bunch of records which are of the scene key type,and then aggregate them by doing a bunch of things andproducing the final result. so everyone who uses this modelof map production will

write essentially to phase. people in the company who wantto process a large amount of data of say, a terabyte of dataon a thousand machines, would take the data and wouldmake the mapping steps and reduction steps. and would run this on a largeamount of machines. and the online library, which isthe map reduction library, would take it and[unintelligible]. for example, how doyou move data from

one machine to another? how do you make sure that whenthe machine actually dies, that it will automaticallyrecover and the intermediate results are not being lost.lastly, what not map reduction does. so here's a pictorialview of that. so again, you'd see thecentral component over here is gfs. so gfs [unintelligible]

has lots and lots of data. and all the boxes in thereare the chunk servers. so every one of the chunkservers has, say, a bunch of-- a partition of the file thatit would like to be map reduced, or have the applicationdecide what to analyze the data. so what the applicationwould do is it'll take all this data. and then applying the mappingstep will produce a whole

bunch of these keyand value tables. and then, we have a shufflingstep which essentially aggregates all values ofthe same key type. so over here, we have that allthe k1's, which are on different machines, which areall getting their data from gfs, will all be shuffled andbe sent across to the reduction step, which willlikely be on another machine. and then, the reduction stepwill go and aggregate all the values within that key type.

and finally, after thatreduction step is done, which could be for example a sum or afilter or whatever you want to do, then it'll lead them backinto gfs for additional processing. so that's the high-levelreview of how map reduction works. any questions about it so far? so why do people usemap reduction? one is it actually works acrossall bunch of machines.

it's an interestingcompetition model. it's nice way to partition anygiven task into a whole bunch of these map and reduction stepsthat can be sent across lots of machines. machines can die, and thesystem will just recover automatically. and it's scalable. we can just keep adding moreand more machines. and also, it's a nicecompetition model in that it

works across a whole bunch ofdifferent applications. it works across search, it workswith our ads, it works with our video. any product that you can thinkof has some component, has at some point in its evolution awhole bunch of map reductions, or analysis that is goingon, to make sure that we understand what the users aredoing or to precompute a bunch of data that will be usefulback to the user. ok, i'm going to switchto bigtables.

any questions about gfs ormap reduction so far? audience: could you give a fewmore examples of using mapreduce to solve particularproblems? presumably, you find that youcan turn a whole bunch of things into mapreducethat you wouldn't have thought you could. narayanan shivakumar: right. so the question was, can i givea few more examples of map reductions?

there are a bunchof applications. one of the simplest onesis, say a graph. so what you would like to do is,let's say you want to run a unix cell graph, or you wantto take a whole bunch of data. and then you want to doa word count for that. so imagine a whole bunch offiles, and you want to do a graph, to compute a wholebunch of keys that are interesting to you, and thendo a word count on that. so what you would do in thatcase is you would make each of

the files as files in gfs. and then the mapping step wouldtake the files and would map it in to, for example,line numbers. and then the counting step wouldactually be done in the reduction step. for another example,to be sorting. so take something assimple as sorting. and say that, i want to sorta megabyte of data. so what i would like to dois, i want to aggregate

essentially all the recordswith the same key into the same machine. and let me do some sort ofaggregation on top of that. again, that would be a naturalthing for map reduction. and at sort of a more productsspecific level, we can think of the crawling pipeline,or indexing pipeline. there are actually a bunch of,maybe 20 different map reductions that run along inevery single part of this pipeline that do various,different things.

again, it's a matter of lookingat data with a whole bunch of records, and againaggregating the keys and maybe counting it up, figuringit out. and then you writeit back into gfs. then you have a second mapreduction that'll come along. but again, work on top of thisdata, and then go along in every step of the pipeline. i'm not sure i was veryhelpful there. yeah.

audience: so the mapreducetechnology that you described, it sounds like an interestingway to distribute computations. the question i have is, haveyou found any problems that map reduction is not enough,that you needed some different way of distributingthe application? narayanan shivakumar: sothe question is, is map reduction enough? map reduction solves a largeclass of problems very nicely.

but there are a bunch ofcases where it doesn't do the right things. so what we do is we, every fewmonths, we have a new set of problems that've come along. and people go and work onbuilding new computation models on top of gfs. and it doesn't have tobe map reductions. so we haven't talked about abunch of things that we've done in-house that aren'tmap reductions.

but it's a nice, convenientthing to do a bunch of things very easily. especially if you've done anysimple things like counting, or filtering. and you have too much thatdoesn't fit into memory, then you would want to usemap reductions. but there are other things whereyou don't actually do filtering and counting,[unintelligible]. and in this cases, we wouldwant to do something else.

but this does not intendto be the ultimate file program language. because that is differentas well. but it works well acrossother machines, or certain types of paths. so we talked about gfs, whichis our file store, it's a large structured file systemwe talked about. you can store whatever youwant to store under that. map reduction is a computationon top of gfs.

so another thing that wehave that works on top of gfs is bigtable. so bigtable is-- you can think of it asa look-up mechanism. so we have a lot of structuredand some not structured data at google. and what we would like to do is,we want to go and quickly query what are differentattributes, or what different values that correspond to agiven key, pretty fast. as

opposed to say, skimming throughgfs files and doing a [unintelligible]on top of that. so it's a look-up table. and traditionally, you'll havedatabases [inaudible]. so the problem that we have isthat the scale is very large. we have lots of urls and lotsand lots of raw data. for example, let's look at maps,which is something we launched about a year back. and that has a bunch of data,and you want to have a storage

bank [unintelligible] quickly look up, given a key,what the value of that particular row would be. and the problem is that,you can't just throw pick your favorite commercialdatabase. because the scale doesn'tquite work. it's way too much. we have tried in thepast few years to use commercial solutions.

but they don't work, partlybecause they are too expensive, or they don't haveenough reliability, or they don't work across a thousandmachines, or, pick your favorite set of problems. and also, the other key is thatby controlling the url storage, you can do a bunchof optimizations. so that you can control how fastsomething would work with a certain kind of computation. so that's why we startedbuilding bigtable about some

few months, year back. also it's just fun to buildlarge systems, so we do that. so the key features is that it'sa map, as i just said. you give it a key, you wantto get back a value. as opposed to gfs, which iswhere you store all your data. so imagine how you build adistributed hashmap, or a distributed just map. and you want it to scale[unintelligible] machines.

and you want it to be fast. soyou want it to be as much profit and memory as possible. and also it has to support lotsof reads, and lots and lots of writes. and also support[unintelligible]. so these are the kinds ofproblems that bigtable solves. and also, it's reliable. you can pick out a bunch ofmachines, you can add in a whole bunch of machines.

and the whole system justmagically works, and by doing load balancing, by moving itaround, by just reconfiguring itself, so that you can justkeep adding more and more machines while the systemis serving. and it doesn't lose data,which is very important. so in terms of building-- yes. audience: on the previous page,it said it was a mapping key to value. is it also sorted by key?

narayanan shivakumar: thequestion is, is it also sorted by key? it's actually, surprisingly,an implementation as it turns out right now, thereis a bunch of sorting that's going on inthe key level. so scanning is prettyefficiency, so you want to give it a range. let's say i want to get backall the tables within one range to another, thenit will support that.

but they haven't sort of saidthat that's going to be true a year from now, in termsof the api. but i suspect it will be. because scans are prettyimportant to do as well. so, yes. audience: can you talk a littlebit about whether-- to what extent did gfsand bigtable wind up doing same thing? for example, with copingwith machine failures.

or does bigtable rely ongfs for everything like that that it needs? narayanan shivakumar: so thequestion was, how do bigtable and gfs compare in termsof machine failures? so gfs is a storage model. and this actually is a nicesegue into my next slide. it talks about buildingblocks. and one of the things thatbigtable does use is gfs. so all the storage happensin gfs, and there's an

instruction on top of that. and there are other componentsthat are [unintelligible] lock servers, that works acrossmultiple machines, and there's a scalar that makessure that machines can go pretty well. but i think gfs is very reliable[unintelligible]. bigtable, what it does reallywell is, it's a nice way into which you can store data, modifydata, and keep growing the [unintelligible]

and do look-ups, which is notsomething that gfs supports. so these are two differentlayers. audience: i think you mighthave answered my question. i was going to ask you tocompare the api, from a programmer's view, togfs versus bigtable. what sorts of operations arepresent in bigtable that aren't in gfs? narayanan shivakumar: yeah,gfs is very low-level. so you want to store a bunchof data into it.

and what it supports is, it letsyou store huge amounts of data, supports very goodread-write bandwidth. as long as it's scanningdata, a lot of them are doing look-ups. and bigtable is more aboutmodifying data and wanting to store lots of data, and thenadding more and more machines as you need to. and it all just magicallyworks. so again, so getting back tothe building blocks, every

single component we buildactually re-use some component that we've already built in thepast. or in the cases when it doesn't work, then wego and re-build the application for it. so for example, we needed tohave lock servers that actually worked reliably across lots and lots of machines. so we built that. and map reduction was just thecomputation on top of this.

and clearly, because it'ssuch a core piece of infrastructure, we certainlywant to make sure that any new system being built will workwith map reductions as well. and also, while at the samepoint, re-using past [unintelligible] of gfs and[unintelligible] as possible so we don't go and implementevery single part of the stack every single time a newapplication comes along. so the question is,are we done? so some of you asked thisa little while back.

we are not even closeto being done. because every few months, thescale just keeps increasing dramatically. and whatever we built beforedoesn't necessarily work yet. because we do think about whatis going to happen a year from now, or two years fromnow, whatever. we build systemsto that scale. but unfortunately, what happensis, the number of users keeps growing,which is nice.

and the amount of datajust keeps growing. so we have to go on and makesure that every single component we have in the stackwill have to be redone to scale to the new extent. so every few months, we havenew competition model that comes along. or we have a new bunch of tweaksthat need to be made with our storage infrastructure,or with the way we treat new rpcs, or withthe way in which we copy data

across from one machine to awhole bunch of other machines. so all this is ongoing. and every so often, every sixmonths, engineering comes along with a bright idea thatsays, what if we do blah? and the blah will lead toeither, say, a hundred fold production in latencyfor doing something. or it will lead to productionin capital expenditure. because we will use maybe onetenth of the capacity of the machines as we didin the past.

and you see the immediateand tangible benefits. because it's scaling. it all scales, so you need tohave a very good background in algorithms to figure out thatif you do x or y, or if you introduce a new competitionmodel, or if you change this part of the stack, then allthese other things will be impacted along theway as well. so the goal really iswhat i said earlier. the goal is always to createsomething that's larger that

actually is higherperformance. and make sure it worksin more ways. so for example, one of theproblems-- so i'm just going to let you know a bunch ofproblems that are interesting open areas, that we arealways looking at. so replication and low latencyaccess from across the world. that's a hard problem. how do you get a whole bunchof machines and a bunch of data centers across the world toreflect the latest version

of what a user did. it's very hard. so we keep working on thosekinds of problems. and then, what happens if abunch of data centers go down? or what happens if a bunchof machines go down? there's a bunch of work thatwe have done so far to make sure that we cope withmany of these things. but as they [unintelligible]keeps changing, we keep solving more of these kinds ofproblems. or even something

like resource variance. how do you have, let's say youhave a lot of machines. how do you make sure that thesemachines, which are doing perhaps multipletasks, how do they do resource sharing? how do you make sure thatserving, which is, let's say that you're doing servingon search traffic. that's clearly more importantthan, your latest count production is running ona few machines and

copying data across. so how do you make sure thatcertain things are prioritized higher than others? that's an open serve issue aswell that we need to work on. any question beforei switch to havoc? audience: so suppose you makechanges through your map reduce or any other algorithms,how do you go back and retroactively update theprevious machine [inaudible]? narayanan shivakumar: so thequestion is, if you go and

change a fundamental library,like map reduction or whatever, how do you go andchange every other application that's running out there? so this gets into how googleengineering works a bit more than i had planned to. but the nice part is that wedon't have too many legacy applications that we needto maintain over the next few years. so every time a person does anew build, or makes up a bunch

of changes, the improvementsare reflected in their applications the next timethey build it and push it out as well. as opposed to [unintelligible]about the client software is out of the field thatwe need to beta into the next five years. so since everyone is buildingand pushing out binaries, they just pick up changeson a regular basis. audience: i'm sorry, i'm notsure i understood that.

narayanan shivakumar: ok. audience: so you'resaying you don't-- you implement going forward. you don't maintain. so let me see. so the problem is that anysoftware that is deployed out in the field, it'sharder to change. that's why you have companiesthat have problems with legacy software.

in our case, since we controlall the applications, every so often, everyone [unintelligible]the next version of a certain productthat is being pushed out. for example, let's say thevideo [unintelligible] is pushing out a new version oftheir other map reduction. then they build, and they pickup the latest set of changes, and then they push it out. which means that the averagetime the binary is out in the thousand machines is actuallyvery, very low.

so then in the next fewweeks, it'll be fixed. audience: so it might be correctto say at any moment, any machine may have multipleversions, different versions of mapreduce, [inaudible]. narayanan shivakumar: there'sa lot of other things i haven't been talking about thati can't even talk about at this point, which are thingsthat we've done, but we're trying to make sure thatversioning problems and consistency, which is actuallya whole bunch of very hard

problems, are actually beingaddressed under the code so what happens if the oneserver talks its own kind of protocol and then you haveanother server that is the client, how does that actuallyrespond to that? so there are a bunchof challenges in that area as well. any other questions? so we've talked abouta slew of products. we've talked about some of thecomputing infrastructure that

we have in terms of threepieces of technology. we've got a lot of others thatwe haven't talked about yet. but these are indicative of thekinds of problems that we like to solve. but beneath that, we have a lotof other things that are going on from a computingstandpoint. so when you have a lot ofmachines, how do you build them to be cost efficient? how do you maintain themso that you have

multiple data centers? what if a bunch ofthese things die? what if the-- how much power you consume,all these are pretty interesting problemsto really look at. so from a computing standpoint,there are lots of problems that i findinteresting. so there's issue aroundserver design. so let's assume that you havethis sort of magical systems

sacks that work on a bunchof different machines. so how do you design the actualfundamental, what kind of [unintelligible]would you choose? what kind of chipsdo you choose? how do you get the rightkind of business to work with the set-up? and power efficiency is anotherarea of importance. so i'll show a graphon this in a slide. networking is hard.

because you have lots of thesemachines, again, that need to talk to each other. and you want to make sure therearen't too many single points of failure, or therearen't, let's say you have a bunch of machines in the back. you want to make sure that theswitches have enough bandwidth capacity to send enough packetsthrough, so they are not a bottleneck. especially they are talkingabout talking from one set of

machines to anotherset of machines. so the classic philosophy we'vehad is, we want to build a lot of these verylow-end machines. so these are the bce servers,that sort of pick up your favorite set of motherboardsthat are commodity pick-up your favorite set of ram andthen disks, and develop all these things. and the assumption is thatany company that sells [unintelligible] reliablehardware, they won't

necessarily work. and that the probability offailure is perhaps lower, but they still need to compensatefor the times that they fail by building a lot of thesesystems on top to recover when the machines actually dieor disks actually die. so the observation was,may as well just go for the cheap things. and just build up a whole bunchof interesting software on top of it, so that wheneverthey do die, because we know

that they will die, that we'lljust sort of recover in a nice seamless way, and our productswill just work. so a lot of you may haveseen this picture. this was the initialgoogle machine. and you can see a bunch of legochassis over here, which held the harddrives. things have evolved a littlebit since then. this was in '99. and again, this gets back tothe ultra-cheap commodity

hardware that we rely on,that the user rely on. and in fact, these werea fire hazard too. because these were builtout of cardboard. and no one told when we werebuilding these things that these thingswill catch fire. but they tried to optimize itby just having the cardboard put the motherboard on top,and then the disks just hanging over there. and then at some point whenthe santa clara fire

department checked it, theymoved to a more traditional kind of data center, more orless, which actually had, you can see some chassis over here,which encompass the motherboard and thecomponents. and the company isa model of-- i'm just amazed every singletime i see a data center just come alive. it's completely empty, and thena few days later, it's just all completely stackedup with machines.

and it's all coming about, andit all just magically works. it's kind of cool. so right now, it's again, thecurrent design is slightly more professional. you can see there's[unintelligible] over here. and again, it's still, a lotof these things are open because of power consumptionand cooling issues. and again, the same sortof architecture.

we have pc class motherboards. and we have very low-end[unintelligible] harddrives, and we run linuxon these things. and we built up all thesesoftware systems on top of these things, that do not knowwhere the machines are, why they are there, when they die,when they come back to life. and we seamlessly movethings around. that's really how thisall kind of works. so [unintelligible], what isgoogle doing in the hardware

space as well as in the softwareor systems space. we've been working on chipmanufacturers for a while. because at some point, werealized that the price for performance was actually gettingbetter with time, over the last few generations. and the prices of these thingswas actually getting better over time as well. as in, you're getting cheapercomponents, and it was pointing in the rightdirection.

but the price for wattage, as inthe amount of power that is being consumed on a performancebasis was actually kind of steady. and the problem was that ifyou want to code a regular motherboard with a bunch ofcomponents, it may take 250 watts to power it up. but then cooling also takesanother 250 watts. and as these systems get moreand more complicated, you had these huge cooling issuesand power issues.

and i think we did a graph thatshowed that, if we just kept going at that rate, thenthe machines would not be the most significant component ofour costs, but [inaudible]. so we've been working with chipmanufacturers for a while to make sure that we have morepower efficient chips, as well as control of it [inaudible]. and the data centers themselvesare pretty interesting. interesting in terms of how muchcooling they require, and

how much time it takes to buildup a whole bunch of these data centers and so on. so a lot of work inthat area as well. audience: do you use any[unintelligible] anymore, or do you do your owndata centers? narayanan shivakumar: thequestion was, do we use [unintelligible] facilitiesanywhere, or do we have our own data center? the answer is really acombination of both.

and i'm thinking ofwhat i can say. so at some point, we had-- it's a pity this thing'srecording, but that's fine. at some point we had a wholebunch of data centers that were coming online. because people thought therewas a lot of money to be made in that. and over time, a lot ofthem went bankrupt. because they couldn't find waysin which to commoditize

other companies coming in andthen hosting these things. and then when they startedevolving into [unintelligible] service providers, and otherbusiness models as well. so just to be safe, i think allcompanies, it makes sense to have a mix of[unintelligible] facilities that you're making,as well as things that and i can't [unintelligible]. so we're doing a bunch of thingsin the hardware area, and working with people whocan supply us with things

that'll work wellin our context. and also, as we talked about,we keep working on these larger and higher performancesystems from a software standpoint, so that every fewmonths we just change some part of the stack, and allthe applications just benefit from that. and that's nice. so now we have a lot of machinesand lots of data and a lot of bandwidth that actuallyyou have between

different machines and differentdata centers, what do you do with it? and this is where we spendmost of our time. we say, wow, we haveall this data. what else can we-- can we pre-compute more? can we make query processingfaster by doing a bunch of these different things? or can we offer a new feature,or a product that actually

will be comparingthe end user. so it's really a mix of-- it's a full cycle. so we start with products. and then the products work ontop of a bunch of interesting systems stack. and then you have the hardwareplatform beneath that, and as the products keep driving theimprovements to the stack, as well as the computing platform,we get more

flexibility to go and do moreand more cool things from the product standpoint. so that's what we keep[inaudible]. so behind every single server wehave, we have a whole bunch of issues that keep coming up. in hardware, in networking,systems, algorithms, program languages, statistics, productdesign, mechanical engineering. and so it's kind of cool playingaround because we're

just having the rightcombination of hardware and software where we cansolve more and more problems with it. and that's really whatwe try to do. so this is again, a slide backjust to place us some context. and just in the last year and ahalf, just from this officer right here, we launched a wholeslew of products, in maps, in crawling, indexing,and so on. every single one of these thingsuses every part of the

stack we have in terms of thesize of the machines, in terms of the storage that we have ingfs, or the [unintelligible] tool that have from mapreductions, or the look-up kind of servers thatbigtable offers. and that's really what we try todo, keep building more and more of these things. any more questions? this is the upcoming talksin the next few weeks. any other questionsabout building

large systems at google? audience: so i want tounderstand a bit about resource allocation. i understand that if you've gotan app, you can say, hey, you can only use 500 machines. so at that level, youcan constrain it. but do different apps use thesame infrastructures? say gfs or bigtable,or do they tend to use their own copies?

in other words, can you controlthe interference between apps? now on a machine to machinelevel, when they used some shared service, or don't they? narayanan shivakumar: thequestion is, are these different pieces ofinfrastructure, are they maintained by the differentteams, or are they services? audience: and if they'reservices-- are they services that multipleapps use, or does

each app have its owncopy of the service? narayanan shivakumar: so dothe different applications have their copy, or do theyhave just one service that they can reuse? the answer is really, it dependson the actual service. some of these are libraries. so every application can justreuse whatever functionality they get out of that. but things like storage.

it again depends on a per groupor per department basis. you might imagine that the adssystem, which is pretty darn important to the company, or thesearch [? service is very important to the company, theywould not be very open to having my favoritemap production running on their bigtable. right? and that's because it's scarythat it could just bring down all of that.

but then again, thereare a bunch of things that are services. for example, there'sone crawler. it doesn't make sense tohave multiple crawlers. so you would have a bunch ofmachines that'll just sit around and will crawl a bunchof the things on the web. because that's an influencethat we have over the rest of the web. so certain pieces ofinfrastructure

make sense as services. and certain just makes sense tohave your own copy based on the kind of reliabilityor the interference you're concerned about. audience: but you've avoidedsituations in which an unimportant app can bog downan important app by interfering in someshared service. so the question is, what aboutinterference between one app and another app.

that's another open areathat we are excited by. and i can't tell you exactly howmuch progress we've made in that area, but it'san important area. the more we can reuse machinesacross different services, the better off we will be. and it's a bunch of hardproblems. so we solve them in ways that work for now, but maynot necessarily work in five years, which iswhy we keep doing audience: so with moreand more [inaudible]

how do you see this changefrom centralized servers [inaudible] what do you do to evolve[inaudible]? narayanan shivakumar: so thequestion was, so far we have had a lot of backend servicethat we've controlled, and what happens if we startshipping software to a bunch of clients, and how dothings evolve in that [unintelligible]? a lot of the applications we'veshipped so far have a

very strong-- they're pretty light on theend user's computer. and the interact a lot with thebackend technologies that we have. so they may issue arequest back to us for strong some component, or may requiresome interaction that would be necessarily on otheruser's computer. and so all the clients we'velaunched are light and thin on the user phasing side, andhave a lot of backend interaction, which is[unintelligible].

audience: so you didn't reallysay much about how you manage software fault toleranceand redundant arrays in inexpensive computersand all that. there's a lot of existingsoftware out there to do that sort of thing, and in clusters,and so forth. are you using some of theexisting software? or doing something completelyproprietary? or did you start with somethingthat was out there and morphed it into somethingproprietary, or what?

narayanan shivakumar: so thequestion was, how much do we reuse software that has beenbuilt up over time for reliability and scaleof delivery? pretty much all the things thatwe do have been built up over time internally. so at some point we did have afew solutions that we tried out, and they just didn'tscale enough. so in that situation, if it'sopen to us, we could do a bunch of things with that.

if it's not, then you may aswell sort of reuse components. and you should realize thatthe benefit of having an existing stack isvery important. so you don't have to redothe entire stack. so if you an existing stack,you know that if you build some component on top of it, youcould reuse say, something like a gfs. or you could reuse somelock [unintelligible] or a lock manager.

and a lot of the existingsoftware solutions, they're not going to have necessarilythe same kind of benefit. and they haven't really workedwith the number of machines or data centers or on the scalethat we have. and can you point to one large scale systemthat works across many thousands of machinesin a reliable, fault [unintelligible] way andprocesses so much of data. there's a lot of research inthis area, but there's not any good commercially deployedsolution that actually

currently works. audience: do you have anyinternal systems that share the knowledge on this? it seems like each applicationthat you build in a different office or whatnot could benefitfrom the information that's learned inother offices. but how do you rebuild specialsoftware that'll leverage technology transfer acrossthe different groups? narayanan shivakumar: so thequestion was how do we make

sure that different officesshare information about how to build different applications? so we have a lot of r&dcenters at this point. we have maybe 20, 25-- 16? audience: 16. i thought it was 20-ish. it's growing every single day. it's very hard to keep trackof these things.

so there is usually a goodshopping problem. as in initially, when a teamstarts on a new operation, you have a bunch of buildingthat needs to happen. how do you use three of thethings that i talked about. and there's a lot of internalknowledge about how that actually happens. but there's no substitutefor having done a so typically what happens is, inany of the r&d centers that we have, there are some peoplewho move from, say, from

mountain view, or from here, orfrom new york, or wherever, to start seeding upthese spaces. so there's some initialknowledge to start groups with. and then over time, you haveenough [unintelligible] built up because you know whatare the [unintelligible] that are going on in bigtable or whatare the [unintelligible]. audience: yeah, i'm curious ifa bigtable [inaudible] can span data centers.

and if so, what are some ofthe big challenges there? narayanan shivakumar: so thequestion is, do the bigtable cells span multipledata centers? i can't comment on exactly howit works across multiple data centers, but it's avery hard problem. and so that's one of the thingsthat google will keep changing as we keepgrowing in size. you can imagine how it worksvery well when you have small amounts of data.

but if you start thinkingabout a petabyte of data thrown into a bigtable, thathas to work around multiple data centers, and will have tobe reliable up to the second in terms of coherency? that's really hard. probably even impossiblein some cases. so again, it gets back to thelist of applications. and perhaps you can get awaywith copying some data relationships.

these are hard problems,right? so the applications will pickand choose between the technology that they have andwill build up things that they need to build up intotheir [inaudible]. i can take two more questions. audience: can you talk alittle bit about qa? what do you do to make sure aproduct is scalable before it goes in to production? the question is about qa.

what do we do before, in termsof, [unintelligible] going into production. testing is a huge component. and engineers have a lot of unittests, progression tests, and this gets into typicalengineering [inaudible]. and as you might imagine, everycompany that works in the space has a canary testingset-up, where they will push out a bunch of binaries andtry to see if it actually works, if it has memory leaks.

if it doesn't do the rightthings, then you have lots of monitoring built in. and once it actually works,then you push it out to production. and there are glitches. and this is partly why it'sinteresting as well. because we are happier aboutmaking quick progress by doing a bunch of changes and pushingthem out into production, rather than waiting for a monthto push out changes.

because as long as we can fixit quickly enough we are in better shape. hacking it used to bea badge of honor. people who broughtdown google.com. and it's kind of niceto have that badge. one last question. audience: i'm just wonderingabout relevancy, like, do you have any measures for anysystems for actually having a permanent storagefor your data.

for example, if gfs[unintelligible]. do you have a more permanent[unintelligible] storage, or somethinglike that? narayanan shivakumar: thequestion is about [unintelligible] storageversus disks. one of the nice parts aboutgfs, as i said earlier, is that we haven't lost muchdata from it at all. since the beginning of time,we've lost 64 megabytes, or a chunk, or whatever.

that's not too bad. but as you might imagine, justfor disaster recovery, every company has turned to policiesabout what data do you want to maintain, and what data doyou want to keep around? and what is the cost for that? it'll have to be done onan application basis. but again, the goal that we haveis always to build up the different components that willwork across a large and varied amount of applications.

and then the application,depending on how important they are, can pick and choose interms of price, in terms of how expensive it is tomaintain and so on. all right, well thank you forcoming and [unintelligible]

Share this

Related Posts

Previous
Next Post »