Webinator version 5 dramatically improves the refresh speed while reducing
the load on web servers. Here we compare the old and new methods so you can
see how it works.
The old way
In Webinator version 4 and earlier, the refresh walk checked
every page in the database to see if it needed updating.
Since only changed pages need updating, and those are typically a
small percentage of the site, this is faster than doing a complete
new walk. However, it is still time-consuming because the
web server must be accessed for every page on the site --
only the web server can tell Webinator if the page has changed or not.
The new way
With Webinator 5's improved refresh, the walk adapts to try to
focus in on that small but important group of changing pages.
As each page is walked, Webinator will calculate a refresh period for
that individual page. This is based on whether it has changed since the last
time it was fetched, and how long ago that fetch was. This refresh period
is used to determine when the page should be checked again.
In this way, the crawl prioritizes walking of often-changing and new pages,
and delays the fetch of relatively static pages.
Thus, when a crawl takes place (scheduled or manual), only the pages that
need to be refreshed now are actually fetched -- not the entire database.
The result is a site search that is more up-to-date while consuming
fewer server resources.
Configuration
On the All Walk Settings page you can configure several aspects of the
refresh period:
- Default Refresh Time: used the first time a page is fetched
- Minimum Refresh Time: used to stop pages being refreshed too often
- Maximum Refresh Time: to ensure that even relatively static pages
are refreshed occasionally.
Changes to the Walk Settings will occur on the next refresh, for example
adding extensions.
Suggested Procedure
- Use Refresh walk all the time. If no walk exists then it will start
a new walk, otherwise it will refresh the existing walk.
- Schedule Refresh walks frequently, e.g. every minute. If there is
nothing to do, or a refresh already running it will quit, otherwise it will
process any pages needed.
- A Refresh walk will also resume any walks that were paused due to
user action, or because the process grew too large.
Walk Status Page
The walk status page will show the URLs most recently fetched, as well as
the URLs that are scheduled to be refreshed next, as well as how many pages
were walked in the past hour and how many are scheduled for the next hour.
|