BigDaddy Means Big Changes at Google
Jan 25, 2006 11:19 AM , By Brian Quinton
After a few months, the engines make another change, and it’s off to the races again.
Well, optimizers on Google are lacing up their running shoes for another race. Only this one promises to be more a marathon than the usual sprint. Google is testing a new data center infrastructure, a feat much bigger and comprehensive than an algorithm change. Dubbed “Big Daddy” both in the search marketing blogs and forums and by the friendly folks at Google, this new data center—still in shakedown mode—will reportedly add new ground-level capabilities into the Google search function and drive those powers deep into all the algorithms with which Google searches, studies and indexes the Web.
First, a bit of big-picture talk. Google’s examination of the Web relies on a global network of data centers with different IP addresses. These decentralized servers speed the job of sending specialized Google services to users in different regions; they also share the workload of spidering the Web and comparing those discoveries to Web pages that are already in Google’s index.
But what is BigDaddy intended to do? According to Rob Sullivan, head organic search strategist at search marketing firm Enquiro, “If an algorithm update is like putting new tires on a car or installing a new stereo system, this BigDaddy is like putting in a whole new motor. They’re totally revamping how Google works and resolving some long-standing issues with getting sites indexed properly.”
One of those issues is “canonicalization”. That’s a fancy Google word for instructing a search engine how to decide which of a series of related URLs is the proper one to insert into the Google index. Say your Web site has a number of different home page URLs, including “stuff.com”, “www.stuff.com”, www.stuff.com/index.html” and “stuff.com/home.asp”. This can come about because Web servers are often set up to accept aliases for Web pages, and to know that a request for “stuff.com” means someone’s looking for “www.stuff.com”. That’s a concession to users who get tired of getting error messages when they don’t type in “www”.
The problem is that while these URLs may pull up the same page content, they’re technically four different pages. That could skew the page count Google gets for the Web site, so that a site with 1000 pages and two aliases per page might look twice its real size to Google.
Finally, a Google search that turns up multiple entries for what is essentially the same content makes the results page that much less valuable to users. Better to select one of the URLs as the most representative and make room for other results.
“If you want to go to the Seattle Seahawks page on the NFL Web site, you’ll get this long, horrendous URL,” Sullivan says. “But the site also has another URL that’s just ‘Seattle Seahawks’. It pulls the content from the first page and just displays it under a prettier URL. So Google wants to be able to say that second page is the one people really want, and they’ll attribute all the traffic, links and value to the shorter URL.”
“302 redirects are a big hole in the system,” Sullivan says. “People are using 302 redirects to hijack content and pages and many other things. By fixing this, Google will be eliminating a lot of problems.”
Of course, how BigDaddy will fix these issues is a closely held secret. As with many other questions surrounding the compiling and ranking of its index, Google refuses to be specific for fear that too much information will only teach the bad guys how to get around the system.
Some of these changes will bring Google’s indexing technology up to par with its competitors; for example, Yahoo! and MSN have been handling 302 redirects for a year or more, although perhaps not as effectively as BigDaddy will eventually do. But other aspects of BigDaddy will help position Google to measure up to the search requirements of the future in some interesting ways, Sullivan says.
“This will lay the groundwork for more advanced algorithms, larger databases, and being able to index different types of content more effectively,” he says. For example, Google has also begun using a search crawler built on a Mozilla browser. The new search bot is more flexible, seems faster and can read non-text content more readily; that should mean that in time, it will be able to read links within images and even within Flash video, matter that gets ignored by bots that can’t speak Javascript.
“As Web technology develops and we get richer and more interactive Web sites, [the search engines] can’t just stick with just indexing hyperlinks and text,” Sullivan says. “They’re going to have to do everything.”