Understanding how Search engines work googlebot and MSNbot

Message # 1 | 6:40 PM 2011-01-01

Googlebot

Quote

What is Googlebot?

Googlebot is Google's web crawling bot (sometimes also called a "spider"). Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.

We use a huge set of computers to fetch (or "crawl") billions of pages on the web. Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.

Googlebot's crawl process begins with a list of webpage URLs, generated from previous crawl processes and augmented with Sitemap data provided by webmasters. As Googlebot visits each of these websites it detects links (SRC and HREF) on each page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.
For webmasters: Googlebot and your site
How Googlebot accesses your site

For most sites, Googlebot shouldn't access your site more than once every few seconds on average. However, due to network delays, it's possible that the rate will appear to be slightly higher over short periods. In general, Googlebot should download only one copy of each page at a time. If you see that Googlebot is downloading a page multiple times, it's probably because the crawler was stopped and restarted.

Googlebot was designed to be distributed on several machines to improve performance and scale as the web grows. Also, to cut down on bandwidth usage, we run many crawlers on machines located near the sites they're indexing in the network. Therefore, your logs may show visits from several machines at google.com, all with the user-agent Googlebot. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth. Request a change in the crawl rate.
Blocking Googlebot from content on your site

It's almost impossible to keep a web server secret by not publishing links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. Similarly, the web has many outdated and broken links. Whenever someone publishes an incorrect link to your site or fails to update links to reflect changes in your server, Googlebot will try to download an incorrect link from your site.

If you want to prevent Googlebot from crawling content on your site, you have a number of options, including using robots.txt to block access to files and directories on your server.

Once you've created your robots.txt file, there may be a small delay before Googlebot discovers your changes. If Googlebot is still crawling content you've blocked in robots.txt, check that the robots.txt is in the correct location. It must be in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.

If you just want to prevent the "file not found" error messages in your web server log, you can create an empty file named robots.txt. If you want to prevent Googlebot from following any links on a page of your site, you can use the nofollow meta tag. To prevent Googlebot from following an individual link, add the rel="nofollow" attribute to the link itself.

Here are some additional tips:

* Test that your robots.txt is working as expected. The Test robots.txt tool in Webmaster Tools lets you see exactly how Googlebot will interpret the contents of your robots.txt file. The Google user-agent is (appropriately enough) Googlebot.
* The Fetch as Googlebot tool in Webmaster Tools helps you understand exactly how your site appears to Googlebot. This can be very useful when troubleshooting problems with your site's content or discoverability in search results.

Making sure your site is crawlable

Googlebot discovers sites by following links from page to page. The Crawl errors page in Webmaster Tools lists any problems Googlebot found when crawling your site. We recommend reviewing these crawl errors regularly to identify any problems with your site.

If your robots.txt file is working as expected, but your site isn't getting traffic, here are some possible reasons why your content is not performing well in search.
Problems with spammers and other user-agents

The IP addresses used by Googlebot change from time to time. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot). You can verify that a bot accessing your server really is Googlebot by using a reverse DNS lookup.

Googlebot and all respectable search engine bots will respect the directives in robots.txt, but some nogoodniks and spammers do not. Report spam to Google.

Google has several other user-agents, including Feedfetcher (user-agent Feedfetcher-Google). Since Feedfetcher requests come from explicit action by human users who have added the feeds to their Google home page or to Google Reader, and not from automated crawlers, Feedfetcher does not follow robots.txt guidelines. You can prevent Feedfetcher from crawling your site by configuring your sever to serve a 404, 410, or other error status message to user-agent Feedfetcher-Google. More information about Feedfetcher.

MSNBot

Quote

MSNbot on some issues not discussed above
Outbound Links tool

The Outbound Links tool provides you with a list of URLs leading to other websites that Bing found on your website.
Use the Outbound Links tool to:

* Find all outbound links pointing to malware

If you want to see a report from Bing containing a list of the outbound links on your website that point to detected malware, select the Show only outbound links to malware check box and then click Search. You can also use the option Filter outbound links by top-level domain, subdomain or subfolder to get a report of outbound links to malware filtered by specific domains, subdomains, websites, and subfolders. For information on how to clean malware from your website, see Remediate detected malware.

When you generate an outbound links to malware report, if there are any such links on your website, you'll see a listing for each suspected link. The link is disabled and flagged with a Malware warning——just as links to malware appear in the Bing results page. To see a detailed report containing all of the links on your website pointing to the malware link, click What links here? next to the Malware warning flag. You'll download a comma separated values (CSV) file containing a list of all of the links to the listed malware website, including the title and the URL of the affected webpages on your website and the malware URL used in the outbound link. This detailed information simplifies your effort in searching for and removing the malware links on your website.
* Filter your results to target specific parts of your website

If you have more than 1,000 occurrences of a specific issue, you can apply a filter to get a report of affected webpages filtered by specific subdomains or subfolders. Simply click Filter selected issues by subdomain or subfolder and type the subdomain or subfolder address you want to use. Be sure to select either Include only results with this URL or Exclude all results with this URL to apply the intended filter.

The Outbound Links tool currently supports filtering up to two subdomains or two subfolders:
o sub1.domain.com
o sub2.sub1.domain.com
o domain.com/folder1
o domain.com/folder1/folder2
o sub1.domain.com/folder1
o sub2.sub1.domain.com/folder1/folder2
* Find all other websites to which users browse from your website

By default, the results table shows all URLs to which users browse from webpages within your website (including URLs within your website). To see only links to third party websites:
1. Click Filter outbound links by top-level domain, subdomain, or subfolder.
2. Type your website’s domain name in the Filter outbound links by top-level domain, subdomain or subfolder box.
3. Select Exclude all results with this URL from the drop-down list.
4. Click Search.

For example, if your Webmaster Center account was msn.com, you would type msn.com in the Filter outbound links by top-level domain, subdomain or subfolder box. If the account you had created was autos.msn.com, you would type autos.msn.com in the Filter outbound links by top-level domain, subdomain or subfolder box.
* View webpages within a specific website to which users navigate from your website

You can see all webpages within another website to which users navigate from your website. For example, you can see all the webpages within the nytimes.com or latimes.com that you link to. To see webpages your website links to:
1. Click Filter outbound links by top-level domain, subdomain, or subfolder.
2. Type the name of the domain (for example, nytimes.com or digg.com) to which you want to see links in the Filter outbound links by top-level domain, subdomain or subfolder box.
3. Select Include only results with this URL from the drop-down list.
4. Click Search.
* Find linking patterns to top-level domains (TLDs)

The presence of links to top-level domains in outbound links might imply certain attributes about navigation patterns coming from your website. For example, if you are see many links to ".fr" or ".es", this could indicate that those websites are useful to your audience. Other examples include links to the ".edu" or ".org" domains. Links to these could indicate that there are academic publications or non-profit organizations that are interesting to your users.

To filter your results to include only links from a specific TLD:
1. Click Filter outbound links by top-level domain, subdomain, or subfolder.
2. Type the name of the TLD to which you want to see link in the Filter outbound links by top-level domain, subdomain or subfolder box.
3. Select Include only results with this URL from the drop-down list.
4. Click Search.
* Work with results offline

The results pane only allows you to see the first 20 results. To see a detailed report containing all of the webpages affected by the selected issue type on your website, click Download all results. You'll download a comma separated values (CSV) file containing a list of all of the affected pages on your website, including the title and the URL of the affected webpages on your website and the date of the last scan that detected the issue. This can help you analyze your results, because you can take advantage of Excel's built-in string functions, as well as the sorting and filtering capabilities.

If you have more than 1,000 results in your issue search, the download feature is renamed Download up to 1000 results. To see more than 1,000 issue search results, use the filter functionality in combination with downloading results. This enables you to create customized reports for different sections of your website. For example, if you were the webmaster for Microsoft.com, you could pull a report of long-dynamic URL issues for just MSDN (msdn.microsoft.com), just TechNet (technet.microsoft.com), or even the Japanese MSDN subsidiary (microsoft.co.jp/msdn).

Frequently asked questions about outbound links

* Do links to unwanted websites adversely affect my Bing ranking?

Possibly. Bing looks at many factors when rating your website, and one of them is the list of webpages that you link to. Bing treats linking as a type of endorsement, so if you are linking to known web spam websites, that could bring down the rank of your website.

In the event that you need to reference a website that you think might be bad, it is best to use the

rel="nofollow"

parameter inside the anchor tag. This way Bing doesn't take that link into account when it ranks your website. Another option is to simply include the link as text and not place it in an anchor tag. Both of these options are also recommended for all links you might have from user-generated content (such as blog comments). Taking these steps helps prevent people from spamming your website.
* How accurate is this data?

The accuracy of the outbound link data can vary. Therefore, use the data as an indicator of overall outbound link trends and view this data in the context of your own web server logs.

Search engines don't store all backlink data in one location. Instead, they use a tree-like structure called distributed storage. This storage model is efficient for general backlink information, but can make getting an exact backlink count at any given time difficult. Webmaster Center strives to make the numbers as meaningful as possible. However, they are estimates and shouldn't be taken literally.

Source

Code

http://search.msn.com/msnbot.htm

Code

http://www.google.com/support/webmasters/

Remember ucoz already provides the robot.txt but the info above is to expllain how robots come to your site.

PiratesHaven

Post edited by shadychiri - Saturday, 2011-01-01, 7:02 PM

Message # 3 | 12:24 PM 2011-05-23

Here is some further info
Site Snapshot

Quote

these days search engines like google can capture snapshots of your site here is more

Instant Previews are page snapshots that are activated by clicking on the magnifying glass icon instant preview icon;
they allow users to get a glimpse of the layout of the web pages behind each search result, in order to help them decide whether or not to click a link.

Instant Preview are extremely useful to users and can help them decide whether or not to click on your site in the search results. You can, however, specify that Google should not display Instant Preview for your page, in which case neither the text snippet nor the preview will appear.

Instant Preview is available for most types of pages in Web search results. If Instant Preview isn't available for your site, it's probably for one of the following reasons:

* The page contains a meta nosnippet tag that prevents Google from displaying snippets and previews in search results.
* Your page is blocked by robots.txt.

In addition, embedded video or other rich media may not be visible in Instant Preview. The preview generator renders JavaScript, AJAX, CSS3, frames, and iframes in the same way that a Safari / Webkit-based browser would. It currently does not support Flash, Silverlight, or Java applets. Flash-based content may be shown as a "puzzle piece" in the preview image. Font-replacement techniques that use Flash are currently not supported.

In general, Google updates the Instant Preview snapshot as part of our web crawling process. When we don’t have a cached preview image (which primarily happens when we can’t fetch the contents of important resources), we may choose to create a preview image on-the-fly based on a user’s request. To generate previews on the fly, Google uses the user-agent Google Web Preview (Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13) to render images on demand, so you may see this in your referrer logs. If you've recently updated your page and the changes aren't reflected in Instant Preview, wait a day or so and then check again.

Recomended reading on search engine optimization

Code

http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf

Added (2011-05-23, 6:24 Am)
---------------------------------------------
here is another important tool for understanding your website

Code

http://www.websitereckon.com/tools/

PiratesHaven