Matt explains in this video that how PageRank is used, crawling timelines, frequencies, priorities, indexing and purifying procedures within the databases.
Here is the transcript rendered by YouTube:
0:00
0:00 MATT CUTTS: Hi, everybody.
0:01 We got a really interesting and very expansive question
0:04 from RobertvH in Munich.
0:06 RobertvH wants to know–
0:09 Hi Matt, could you please explain how Google’s ranking
0:12 and website evaluation process works starting with the
0:14 crawling and analysis of a site, crawling time lines,
0:18 frequencies, priorities, indexing and filtering
0:21 processes within the databases, et cetera?
0:25 OK.
0:25 So that’s basically just like, tell me
0:27 everything about Google.
0:28 Right?
0:29 That’s a really expansive question.
0:30 It covers a lot of different ground.
0:32 And in fact, I have given orientation lectures to
0:35 engineers when they come in.
0:37 And I can talk for an hour about all those different
0:40 topics, and even talk for an hour about a very small subset
0:43 of those topics.
0:45 So let me talk for a while and see how much of a feel I can
0:48 give you for how the Google infrastructure works, how it
0:51 all fits together, how our crawling and indexing and
0:53 serving pipeline works.
0:55 Let’s dive right in.
0:57 So there’s three things that you really want to do well if
0:59 you want to be the world’s best search engine.
1:01 You want to crawl the web comprehensively and deeply.
1:03 You want to index those pages.
1:05 And then you want to rank or serve those pages and return
1:08 the most relevant ones first.
1:10 Crawling is actually more difficult
1:11 than you might think.
1:13 Whenever Google started, whenever I joined back in
1:16 2000, we didn’t manage to crawl the web for something
1:18 like three or four months.
1:20 And we had to have a war room.
1:22 But a good way to think about the mental model is we
1:25 basically take page rank as the primary determinant.
1:28 And the more page rank you have– that is, the more
1:31 people who link to you and the more reputable those people
1:34 are– the more likely it is we’re going to discover your
1:37 page relatively early in the crawl.
1:39 In fact, you could imagine crawling in strict page rank
1:41 order, and you’d get the CNNs of the world and The New York
1:45 Times of the world and really very high page rank sites.
1:49 And if you think about how things used to be, we used to
1:51 crawl for 30 days.
1:53 So we’d crawl for several weeks.
1:56 And then we would index for about a week.
1:59 And then we would push that data out.
2:01 And that would take about a week.
2:04 And so that was what the Google dance was.
2:05 Sometimes you’d hit one data center that had old data.
2:07 And sometimes you’d hit a data center that had new data.
2:10 Now there’s various interesting tricks
2:13 that you can do.
2:13 For example, after you’ve crawled for 30 days, you can
2:16 imagine recrawling the high page rank guys so you can see
2:19 if there’s anything new or important that’s hit on the
2:21 CNN home page.
2:22 But for the most part, this is not fantastic.
2:25 Right?
2:25 Because if you’re trying to crawl the web and it takes you
2:28 30 days, you’re going to be out-of-date.
2:30 So eventually, in 2003, I believe, we switched as part
2:36 of an update called Update Fritz to crawling a fairly
2:40 interesting significant chunk of the web every day.
2:43 And so if you imagine breaking the web into a certain number
2:47 of segments, you could imagine crawling that part of the web
2:51 and refreshing it every night.
2:53 And so at any given point, your main base index would
2:58 only be so out of date.
3:00 Because then you’d loop back around and you’d refresh that.
3:03 And that works very, very well.
3:04 Instead of waiting for everything to finish, you’re
3:06 incrementally updating your index.
3:08 And we’ve gotten even better over time.
3:10 So at this point, we can get very, very fresh.
3:14 Any time we see updates, we can usually
3:16 find them very quickly.
3:18 And in the old days, you would have not just a main or a base
3:20 index, but you could have what were called supplemental
3:24 results, or the supplemental index.
3:26 And that was something that we wouldn’t crawl and refresh
3:28 quite as often.
3:29 But it was a lot more documents.
3:31 And so you could almost imagine having really fresh
3:35 content, a layer of our main index, and then more documents
3:40 that are not refreshed quite as often, but there’s a lot
3:42 more of them.
3:43 So that’s just a little bit about the crawl and how to
3:45 crawl comprehensively.
3:47 What you do then is you pass things around.
3:49 And you basically say, OK, I have crawled a large fraction
3:53 of the web.
3:54 And within that web you have, for example, one document.
3:58 And indexing is basically taking things in word order.
4:04 Well, let’s just work through an example.
4:06 Suppose you say Katy Perry.
4:10 In a document, Katy Perry appears right
4:13 next to each other.
4:14 But what you want in an index is which documents does the
4:18 word Katy appear in, and which documents does the word
4:20 Perry appear in?
4:22 So you might say Katy appears in documents 1, and 2, and 89,
4:26 and 555, and 789.
4:32 And Perry might appear in documents number 2, and 8, and
4:37 73, and 555, and 1,000.
4:42 And so the whole process of doing the index is reversing,
4:47 so that instead of having the documents in word order, you
4:50 have the words, and they have it in document order.
4:53 So it’s, OK, these are all the documents that a
4:54 word appears in.
4:56 Now when someone comes to Google and they type in Katy
4:59 Perry, you want to say, OK, what documents might match
5:02 Katy Perry?
5:03 Well, document one has Katy, but it doesn’t have Perry.
5:06 So it’s out.
5:08 Document number two has both Katy and Perry, so that’s a
5:11 possibility.
5:12 Document eight has Perry but not Katy.
5:15 89 and 73 are out because they don’t have the right
5:18 combination of words.
5:19 555 has both Katy and Perry.
5:22 And then these two are also out.
5:25 And so when someone comes to Google and they type in
5:27 Chicken Little, Britney Spears, Matt Cutts, Katy
5:29 Perry, whatever it is, we find the documents that we believe
5:32 have those words, either on the page or maybe in back
5:35 links, in anchor text pointing to that document.
5:38 Once you’ve done what’s called document selection, you try to
5:41 figure out, how should you rank those?
5:43 And that’s really tricky.
5:44 We use page rank as well as over 200 other factors in our
5:49 rankings to try to say, OK, maybe this document is really
5:52 authoritative.
5:53 It has a lot of reputation because it has
5:55 a lot of page rank.
5:56 But it only has the word Perry once.
5:58 And it just happens to have the word Katy somewhere else
6:01 on the page.
6:02 Whereas here is a document that has the word Katy and
6:04 Perry right next to each other, so there’s proximity.
6:07 And it’s got a lot of reputation.
6:09 It’s got a lot of links pointing to it.
6:12 So we try to balance that off.
6:13 You want to find reputable documents that are also about
6:16 what the user typed in.
6:18 And that’s kind of the secret sauce, trying to figure out a
6:20 way to combine those 200 different ranking signals in
6:23 order to find the most relevant document.
6:25 So at any given time, hundreds of millions of times a day,
6:30 someone comes to Google.
6:32 We try to find the closest data center to them.
6:34 They type in something like Katy Perry.
6:36 We send that query out to hundreds of different machines
6:38 all at once, which look through their little tiny
6:41 fraction of the web that we’ve indexed.
6:43 And we find, OK, these are the documents that
6:45 we think best match.
6:47 All those machines return their matches.
6:49 And we say, OK, what’s the creme de la creme?
6:52 What’s the needle in the haystack?
6:53 What’s the best page that matches this query across our
6:56 entire index?
6:57 And then we take that page and we try to show it with a
7:00 useful snippet.
7:01 So you show the key words in the context of the document.
7:03 And you get it all back in under half a second.
7:06 So that’s probably about as long as we can go on without
7:10 straining YouTube.
7:11 But that just gives you a little bit of a feel about how
7:13 the crawling system works, how we index documents, how things
7:16 get returned in under half a second through that massive
7:19 parallelization.
7:20 I hope that helps.
7:21 And if you want to know more, there’s a whole bunch of
7:23 articles and academic papers about Google, and page rank,
7:26 and how Google works.
7:28But you can also apply to–
7:30there’s jobs(at)google.com, I think, or google.com/jobs, if
7:34you’re interested in learning a lot more about how search
7:36engines work.
7:37OK.
7:37Thanks very much.
7:39