Jump to content
Gonzalez

How to Get Search Engines to Discover (Index)

Recommended Posts

How to Get Search Engines to Discover (Index) All the Web Pages on Your Site

by Christopher Heng

If your site is one of those websites where only a few pages seem to be indexed by the search engines, this article is for you. It describes how you can provide the major search engines with a list of the all the pages on your website, thus allowing them to learn of the existence of pages which they may have missed in the past.

How do you Find Out which Pages of your Website is Indexed?

How do you know which pages of your site has been indexed by a search engine and which not? One way is to use "site:domain-name" to search for your site. This works with Google, Yahoo and Microsoft Live, although not with Ask.

For example, if your domain is example.com, type "site:example.com" (without the quotes) into the search field of the search engine. From the results list, you should be able to see all the pages which the search engine knows about. If you find that a page from your site is not listed, and you have not intentionally blocked it using robots.txt or a meta tag, then perhaps that search engine does not know about that page or has been unable to access it.

Steps to Getting the Search Engine to Discover and Index Your Whole Site

Here's what to do, when you discover that there are pages not indexed by the search engine.

Check Whether Search Engines are Blocked from that Page

The first thing to do is to check your robots.txt file, and make sure it complies with the rules of a robots.txt file. Many webmasters, new and old, unintentionally block a search engine from a part of their site by having errors in their robots.txt file.

Another thing you might want to do is to make sure that your web page does not have a meta tag that prevents a robot from indexing a particular page. This may occur if you have ever put a meta "noindex" tag on the page, and later wanted it indexed but forgot to remove it.

Create a File Using the Sitemap Protocol

The major search engines, Google, Yahoo, Live and Ask, all support something known as a Sitemap file. This is not the "Site Map" that you see on many websites, including thesitewizard.com. My Site Map and others like it are primarily designed to help human beings find specific pages on the website. The sitemap file that uses the Sitemap protocol is, instead, designed for search engines, and is not at all human-friendly.

Sitemaps have to adhere to a particular format. The detailed specifications for this can be found at the sitemaps.org website. It is not necessary to use every aspect of the specification to create a site map if all you want is to make sure the search engines locate all your web pages. Details on how to create your own sitemap will be given later in this article.

Modify Your Robots.txt File for Sitemaps Auto-Discovery

As a result of the sitemap protocol, an extension to the robots.txt file has been agreed by the search engines. Once you have finished creating the sitemap file and uploaded it to your website, modify your robots.txt file to include the following line:

Sitemap:

http://www.example.com/name-of-sitemap-file.xml

You should change the web address ("URL") given to the actual location of your sitemap file. For example, change

www.example.com

to your domain name and "name-of-sitemap-file.xml" to the name that you have given your sitemap file.

If you don't have a robots.txt file, please see my article on robots.txt for more information on how to create one. The article can be found at

http://www.wjunction.com/showthread.php?t=73109

The search engines that visit your site will automatically look into your robots.txt file before spidering your site. When they read the file, they will see the sitemap file listed and load it for more information. This will enable them to discover the pages that they have missed in the past. In turn, this will hopefully send them to index those files.

How to Create a Sitemap File

A sitemap file that follows the Sitemap Protocol is just a straightforward ASCII text file. You can create it using any ordinary ASCII text editor. If you use Windows, Notepad (found in the Accessories folder of your Start menu) can be used. Do not use a word processor like Microsoft Office or Word.

By way of example, take a look at the following sample sitemap file.

You will notice that a sitemap file begins with the text

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

and ends with

</urlset>

Those portions of the sitemap file are invariant. All sitemaps have to begin and end this way, so you can simply copy them from my example to your own file.

Next, notice that every page on the website (that you want indexed in the search engine) is listed in the sitemap, using the following format:

<url><loc>http://www.example.com/</loc></url>

where

http://www.example.com/

should be replaced by the URL of the page you want indexed. In other words, if you want to add a page, say,

http://www.example.com/sing-praises-for-thesitewizard.com.html 

to your website, just put the web address for that page between

<url><loc>

and

</loc></url>

, and place the entire line inside the section demarcated by

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

and

</urlset>

.

To make your job simpler, just copy the entire example sitemap that I gave in the example above, replace all the example URLs with your own page addresses, add any more that you like, and you're done.

Save the file under any name you like. Most people save it with a ".xml" file extension. If you don't have any particular preference, call it "sitemap.xml". If you use Notepad instead of a decent text editor.

Remember to update your robots.txt file as mentioned earlier to include the URL of your sitemap file, so that the search engines can learn of the existence of the file.

Note: a sitemap file cannot have more than 50,000 URLs (web addresses) nor be bigger than 10 MB. If yours is bigger than that, you'll have to create multiple sitemap files. Please see the Sitemaps site on how this can be done.

Conclusion: Dealing with Missing Pages in the Search Engine's Index

If you have pages on your website that seem to be omitted from the search engine indices, following the tips in this article will help you make sure that the search engines learn of all the pages on your web site. Of course, whether they actually go about spidering and listing them is another matter. However, with the sitemap file, you can at least know that they are aware of all the available pages on your site.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...