To offer the best possible results, search engines must attempt to discover all the public pages on the World Wide Web and then present the ones that best match up with the user’s search query. The first step in this process is crawling the web. The search engines start with a seed set of sites that are known to be very high quality sites, and then visit the links on each page of those sites to discover other web pages. The link structure of the Web serves to bind together all of the pages that were made public as a result of someone linking to them. Through links, search engines’ automated robots, called crawlers, spiders can reach the many billions of interconnected documents.
The search engine will then load those other pages and analyze that content as well. This process repeats over and over again until the crawling process is complete. This process is an enormously complex one as the web is a large and complex place.
Search engines do not attempt to crawl the entire web every day. In fact they may become aware of pages that they choose not to crawl because they are not likely to be important enough to return in a search result.
The first step in this process is to build an index of terms. This is a massive database that catalogs all the significant terms on each page crawled by the search engine. A lot of other data is also recorded, such as a map of all the pages that each page links to, the clickable text of those links (we call this the anchor text), whether or not those links are considered ads, and more To accomplish the monumental task of holding data on hundreds of billions (or trillions) of pages that can be accessed in a fraction of a second, the search engines have constructed massive data centers to deal with
all this data. One key concept in building a search engine is deciding where to begin a crawl of the Web. Although you could theoretically start from many different places on the Web, you would ideally begin your crawl with a trusted seed set of websites. Starting with a known trusted set of web sites enables search engines to measure how much they trust the other web sites that they find through the crawling process