Thunderstone Software Document Retreival and Management
Search:
Advanced Search
Home | Products | Company | News | Tech Support | Demos | Contact Us

Webinator version 5 dramatically improves the refresh speed while reducing the load on web servers. Here we compare the old and new methods so you can see how it works.

The old way

In Webinator version 4 and earlier, the refresh walk checked every page in the database to see if it needed updating. Since only changed pages need updating, and those are typically a small percentage of the site, this is faster than doing a complete new walk. However, it is still time-consuming because the web server must be accessed for every page on the site -- only the web server can tell Webinator if the page has changed or not.

The new way

With Webinator 5's improved refresh, the walk adapts to try to focus in on that small but important group of changing pages. As each page is walked, Webinator will calculate a refresh period for that individual page. This is based on whether it has changed since the last time it was fetched, and how long ago that fetch was. This refresh period is used to determine when the page should be checked again. In this way, the crawl prioritizes walking of often-changing and new pages, and delays the fetch of relatively static pages.

Thus, when a crawl takes place (scheduled or manual), only the pages that need to be refreshed now are actually fetched -- not the entire database. The result is a site search that is more up-to-date while consuming fewer server resources.

Configuration

On the All Walk Settings page you can configure several aspects of the refresh period:

  • Default Refresh Time: used the first time a page is fetched
  • Minimum Refresh Time: used to stop pages being refreshed too often
  • Maximum Refresh Time: to ensure that even relatively static pages are refreshed occasionally.

Changes to the Walk Settings will occur on the next refresh, for example adding extensions.

Suggested Procedure

  1. Use Refresh walk all the time. If no walk exists then it will start a new walk, otherwise it will refresh the existing walk.
  2. Schedule Refresh walks frequently, e.g. every minute. If there is nothing to do, or a refresh already running it will quit, otherwise it will process any pages needed.
  3. A Refresh walk will also resume any walks that were paused due to user action, or because the process grew too large.

Walk Status Page

The walk status page will show the URLs most recently fetched, as well as the URLs that are scheduled to be refreshed next, as well as how many pages were walked in the past hour and how many are scheduled for the next hour.

 
Home   ::   Products   ::   Company   ::   News   ::   Tech Support   ::   Demos   ::   Contact Us
Copyright © 2008 Thunderstone Software LLC. All rights reserved.