Main | January 2006 »

December 31, 2005

Getting New Content Pages Indexed Quickly

podcast mp3 icon

Introduction

Ever wonder how you can get new content pages indexed quickly?

In most cases you create the new page, publish it to your Web site, provide some internal linking to the page so your users and the search engine spiders can find it, and wait. It can take as little as a day or two or weeks before it's indexed. If your Web site is spidered frequently, you have a much better chance of getting the page indexed in a few days. For the rest of us, we end up waiting.

Solution

Analyze your Web site access log files and find all of the page requests that returned in 404 errors. Once you have this list, search the User-Agent field and find all the log entries that were generated by search engine spiders. These are typically the pages on your Web site that used to exist, but now have been removed or moved.

The search engine spiders don't know the pages are no longer at that location, so they keep coming back to reindex them. Here's the kicker, replace those missing pages with your new content! The new pages you create will now get indexed quickly and show up in the SERPs much quicker than if you had just posted new content.

Skill Level

Beginner |•••••|•••••| Advanced

More Information

There are three basic ways to analyze your access log file, analytics software, analytics services, and the 'grep' utility that comes loaded on any Unix-based operating system. For Windows users I recommand sticking with the analytics software or service if possible because Windows doesn't come with a 'grep' utility.

Log File Analytics Software

Using Analytics software is by far the best choice if you can afford it or have access to an analytics package capable of providing the information needed from the access log file. One of the programs we sell, Robot-Manager Professional, is capable of giving you this informaiton quickly. There are other programs available too, and you may already have one that can give you this information. If you do, save your money and use what you already have.

Log File Analytics Service

If your Web hosting provider has an analytics package installed on the server, you may be able to use it to generate a report showing all of the page requests that generated a 404 error code. They may even allow you to filter that report down to just search engine spider requests. If not, you'll need to manually look for the search engine spider requests.

If you're using a site statistics service that has you insert Java code or an invisible image on your page, it won't work. The reason is two fold. First, the page that is being request is missing so it will never load the Java code. And second, the search engine spiders you're interested in don't normally request images from the pages they index.

The GREP Utility

If your Web site is hosted on a Unix-based server, you should be able to Telnet or SSH into your Web site and run grep on your access log file. You may need to contact your ISP to find out if this is possible and which Telnet clients they support.

The default name of your access log file on an Apache server is usually access_log. You'll notice that there is no extension on the file name. Unix-based servers don't use extensions to tell whether a file is a program or not. Again, you may need to contact your ISP to find out what the name of your access log file is.

Assuming your access log file is named access_log, the command below will count the number of entries in your log file and display that number. I always try and count the entries first so I have an idea of what I'm about to face.

    grep -c ' 404 ' access_log

If your log file is compressed, like mine, you'll need to run a similar command like this instead.

    gunzip -c access.log.gz | grep ' 404 '

Please notice the space characters before and after the text 404 in the command line. Including these are important to finding the correct entries. Once we have a count, we can generate a file that has all (almost all) of the search engine spider requests. We'll add a few more parameters to the command line to help narrow this file down.

    grep ' 404 ' access_log | grep -iv 'mozilla' | grep -iv 'favicon.ico' > 404.txt
    gunzip -c access.log.gz | grep ' 404 ' | grep -iv 'mozilla' | grep -iv 'favicon.ico' > 404.txt

The multiple grep's in the same line are used to eliminate some of the more common entries generated from a Web browser. The addition of '-iv' on the grep command line tells grep to ignore case and to select entries not matching our text.

The 404.txt file should now have a list of pages requested by the search engine spiders that are no longer available on your Web site. You can display this file using cat 404.txt from the command line. Now go through this file and find pages that you can use to publish new content too.

Summary

This spider trick can be used by SEO and SEM companies and Web site owners that are looking to add new content to a Web site quickly. We skipped over some parts of the process in this article to keep it brief. If you have basic knowledge of the workings of your Web server, this task should be easy to do and probably take less than an hour to complete.

Exploring Further

When you start exploring the 404.txt file you created, you'll soon discover requests for pages that never existed on your site, or even improperly formated requests. Most requests that look strange or are to pages that never existed, in directories that never existed, are hack attempts on your Web site. You can probably safely ignore these.

The other requests, though, are a little goldmine of potential tweaks to your Web site. For instance, for what ever reason, msn boot has been requesting the page /default.htm from my Web site. Maybe it found this page through an incorrect link on another Web site. In any case, I can use this information and apply a 301 redirect for that page though my Web server's configuration file. I'm using Apache so I added the following line to httpd.conf. You can also add it to your .htaccess file.

    redirect 301 /default.htm http://www.websitemanagementtools.com/

The next time msn boot comes to my Web site and requests the page in error, msn boot will be redirected to the correct page, and be told at the same time that this change is permenant (i.e., update your index to point to the new page). You can apply this trick to virtually any type of request—even requests that are formated incorrectly.

December 24, 2005

Welcome to Web Wisdom the Podcast

This weblog is dedicated to feedback and discussion on the Web Wisdom podcast.

The host of the show, Michael Lange, is a longtime veteran to the Internet (since 1995) and has been designing, promoting, and marketing Web sites since 1998. He is the author of many successful software programs including TopDog, TopDog Pro, Robot-Manager, Ranking-Manager, Website-Manager, and many others. Thousands of Web masters, SEOs, and the like use his software to optimize and manage their Web site's on the Internet.