Regional News: Boston | D.C.| New York | Silicon ValleyMore Tech News:Newslinx
BusinessDeveloperEcommerceenterpriseNetworkingSecurity
StorageWirelessxspspecialstatsCommentary
you are in:
internetnews.comxSP
7 day summary

internet.com


internet.commerce
Be a Commerce Partner
Laptop Computers
Price Search
CRM Software
New Car Prices
Auto Insurance
Prepaid Phone Card
Promotional Pens
Franchise Directory
Send Files Up To 10GB
IT Discount Club
Domain Registration
Desktop Computers
Graphics Cards
Compare Prices

Newsletter Signup

Internet Daily

InternetNews Business Report

Boston News

DC News

NY News

SiliconValley News

select a newsletter above, type your email and click the arrow to sign up!

internet.com
Developer
International
Internet Lists
Internet News
Internet Resources
IT
Linux/Open Source
Personal Technology
Small Business
Windows Technology
xSP Resources
Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

Support for Itanium 2 Grows as Dual-Core Processor Launches
Intel's Itanium 2 processors, which allow users to run mission-critical computing applications normally reserved for mainframes and proprietary RISC-based systems, are now available in a dual-core model. Combine that with major commitments from a number of software vendors and OEM support, and the future for the Itanium family looks bright. >>
 
Product Brief: Dual-Core Intel Xeon Processor 5100 Series
Get a comprehensive look at the details behind the features, architecture, and platforms available for your servers and workstations. >>
 
Industry Brief: Manufacturing
Manufacturing organizations need to accelerate the pace of innovation, transform supply chains, and design complex products in less time to fuel growth. An infrastructure based on the Dual-Core Intel Xeon platform can help them gain the advantages they need. >>
 
Competitive Comparisons
Dual-Core Intel Xeon processor-based platforms offer high performance and low energy consumption. Read how Intel does it, and how Xeon platforms stack up against the competition in performance, architecture, and features. >>

A Higher Google Standard?
Google's Recipe for Profit
Mapping Google's Latest Direction
For more stories on:
stats
Gartner to Discuss SaaS Explosion at Show
Newspapers Will Not Die
August Chip Sales Notch $20.5 Billion
Political Junkies Getting Online Fix
NYTimes Revenue Dips
Report: Home Users Blind To Hackers
Where Have All The Laptops Gone?
Report: VoIP Market on The Upswing
Will a New Battery Snafu Prompt Changes?
Chips Take July For $20B
Whitepaper: Ensuring Business Continuity in a Dynamic IT Environment. Learn how HP Integrity servers with Intel Itanium 2 processors offer a unique solution providing reliability, flexibility & more.

xSP
March 2, 2005
Peeking Into Google
By Susan Kuchinskas

BURLINGAME, Calif. -- The key to the speed and reliability of Google (Quote) search is cutting up data into chunks, its top engineer said.

Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture on Wednesday. Hoelzle spoke here at EclipseCon 2005, a conference on the open source, extensible platform for software tools.

To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.

All machines run on a stripped-down Linux kernel. The distribution is Red Hat (Quote), but Hoelzle said Google doesn't use much of the distro. Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.

"The downside to cheap machines is, you have to make them work together reliably," Hoelzle said. "These things are cheap and easy to put together. The problem is, these things break."

In fact, at Google, many will fail every day. So, Google has automated methods of dealing with machine failures, allowing it to build a fast, highly reliable service with cheap hardware.

Google replicates the Web pages it caches by splitting them up into pieces it calls "shards." The shards are small enough that several can fit on one machine. And they're replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. The engineers call these "chunk servers."

As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box.

In parallel, clusters of document servers contain copies of Web pages that Google has cached. Hoelzle said that the refresh rate is from one to seven days, with an average of two days. That's mostly dependent on the needs of the Web publishers.

"One surprising limitation is we can't crawl as fast as we would like, because [smaller] webmasters complain," he said.

Each set of document servers contains one copy of the Web. These machines are responsible for delivering the content snippets that show searchers relevant text from the page.

"When we have your top 10 results, they get sent to the document servers, which load the 10 result pages into memory," Hoelzle said. "Then you parse through them and find the best snippet that contains all the query words."

Google uses three software systems built in-house to route queries, balance server loads and make programming easier.

The Google File System was written specifically to deal with the cheap machines that will fail.

"We take our files and chunk them up, then you randomly distribute the chunks across different machines, making sure each chunk has at least two copies that are not physically adjacent -- not on same power strip or same switch," Hoelzle said. "We try to make sure that even if one copy goes away, another copy is still here." Chunks typically are 64 megabytes and are replicated three times.

All this replication makes it easier to make changes, Hoelzle said. Google simply takes one replica at a time offline, updates it, then plugs the machines back in.

Because these chunks are randomly distributed all over, Google needs a master containing metadata to keep track of where the chunks are. When a query comes into the system, the file system master tells it which chunk server has the data. "From there on, you just talk to the chunk servers," he said.

Client machines are responsible for dealing with fault tolerance. If a client requests a file from the specified chunk server and gets no response within the designated time period, it uses the meta information to locate another chunk server, while sending the file master a hint that the first chunk server might have died. If the master confirms the chunk went out, it will replicate the chunks that were on it to another server, making sure that the information is replicated at least the minimum number of times.

"You were vulnerable for only a very brief period," he said.

To enable Google programmers to write applications to run in parallel on 1,000 machines, engineers created the Map/Reduce Framework in 2004.

"The Map/Reduce Framework provides automatic and efficient parallelization and distribution," Hoelzle said. "It's fault tolerant and it does the I/O scheduling, being a little bit smart about where the data lives."

Programmers write two simple functions, map and reduce, to create a long list of key/value pairs. Then, the mapping function produces other key/value pairs. "You just map one pair to another pair," he said.

For example, if an application is needed to count URLs on one host, the programmer would take the URL and the contents and map them into the pair consisting of hostname and 1. "This produces an intermediate set of key/value pairs with different values."

Next, a reduction operation takes all the outputs that have the same key and combines them to produce a single output.

"Map/Reduce is simplified large-scale data processing," Hoelzle said, "a very simple abstraction that makes it possible to write programs that run over these terabytes of data with little effort."

The third homegrown application is Google's Global Work Queue, which is for scheduling.

Global Work Queue works like old-time batch processing. It schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data.

"Mostly, you want to split the huge task into lots of small chunks, which provides even load balancing across machines," Hoelzle said. The idea is to have more tasks than machines so machines are never idle.

Hoelzle also demonstrated how Google uses its massive architecture to learn from data. It analyzes the most common misspellings of queries, and uses that information to power the function that suggests alternate spellings for queries.

The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words.

To do this, the system tries to cluster concepts into "reasonably coherent" subclusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. This kind of machine learning has had little success in academic trials, Hoelzle said, because they didn't have enough data. "If you have enough data, you get reasonably good answers out of it."

In addition to improving query results, Google uses this learning to better deliver contextual ads for its AdSense service to Web publishers, as well as to more accurately cluster news stories within Google News.

Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.

"You don't have just one data center," he said, "you have multiples."


xSP Archives

7 day summary

Receive news via our XML/RSS feed


Add internetnews.com to your browser search box.
IE 7 | Firefox 2.0 | Firefox 1.5.x


recent headlines
xSP News
SaaS Vendors Make Partners Pay The Freight
Firefox 3.0 Already?
Yahoo Gears Toolbar, Bookmarks For Social Search
Holiday PC Buyers to Get Vista Coupons
Giving Voice to Text Messages
More News...
Top Stories
Yahoo Gears Toolbar, Bookmarks For Social Search
Firefox 3.0 Already?
Amazon Stands Tall
Chambers: If It's Obvious, It's Too Late
SaaS Vendors Make Partners Pay The Freight
More News...


Contact internetnews.com staff


JupiterWeb networks:

internet.comearthweb.comDevx.comGraphics.com

Search JupiterWeb:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterWeb

Jupitermedia Corporate Info

Copyright 2006 Jupitermedia Corporation All Rights Reserved.
Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers


Warning: include(../../../google.php): Failed to open stream: No such file or directory in /web/westerfunk/archives/technology/Google Architecture/index.php on line 1026

Warning: include(): Failed opening '../../../google.php' for inclusion (include_path='.:/usr/share/php') in /web/westerfunk/archives/technology/Google Architecture/index.php on line 1026