Geruva Publications - Software Dept.

Order the CD

Contact us by email

Home Page


cs2011 A collection of Robots for Web Operations, and related things - or - The Web Scraper's Companion

Copyright 2004 Edition Arnold Kochman. Other copyrights apply, including but not limited to the GNU Public License. This CD is for those interested in automation in searching, maintenance, and validation tasks on the internet. There is more here for Linux and Unix users than for users of other systems, but Windows users can also find value here. You will need one of the commonly available unzip type utilities, such as PKUNZIP or WinZip. Most are distributed as source. Programs in C, for example, will have to be compiled.

Some of these programs are clearly for programmers, though some can be used by ordinary people. The various packages are at different levels of maturity and completeness, and I cannot certify that they are all worthwhile for any particular purpose. You will have to judge for yourself, but there is a lot to choose from.

Here are the packages that I have included:

Harvest - A web indexing package, originally disigned for distributed indexing, it can form a powerful system for indexing both large and small web sites. It includes Harvest-NG a highly efficient, modular, perl-based web crawler.

Combine - An open system for harvesting and threshing (indexing) Internet resources. Combine was developed as a part of the Development of a European Service for Information on Research and Education (DESIRE) project, which was funded by the European Commission within Telematics for Science Program.

Libwww - Perl (5) modules for accessing Web pages, including some examples of following links.

MOMspider is a web-roaming robot that specializes in the maintenance of distributed hypertext infostructures (i.e. wide-area webs). The program is written in Perl and, once customized for your site, should work on any UNIX-based system with Perl 4.036. Copyright © 1994 Regents of the University of California, distributed with permission.

Acme.Spider - A web-robot class for JAVA. This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection. The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.

WebSPHINX - A Java class library and interactive development environment for web crawlers. Copyright © 1998-2002 - Carnegie Mellon University, distributed with permission. This product includes software developed by the Apache Software Foundation (http://www.apache.org/). WebSPHINX includes the Apache jakarta-regexp regular expression library, version 1.2. The (unmodified) source code for this library is included in the WebSPHINX source code. Redistribution is allowed under the terms of the Apache Public License.

Webbot - A very fast Web walker with support for regular expressions, SQL logging facilities, and many other features. The webbot comes with the libwww codebase. It can be used to check links, find bad HTML, map out a web site, download images, etc. Source code is in C, based on "Libwww." Webbot was designed to test HTTP/1.1 pipelining, but is useful for other purposes.

ht://Dig - The ht://Dig system is a complete indexing and searching system for a domain or intranet.

SWISH-E - A fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other files. It is highly configurable. SWISH-E is distributed with no warranty under the terms of the GNU Public License,

Pavuk - A UNIX program used to mirror contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionaly from HTTPS (HTTP over SSL) servers. Pavuk have optional GUI based on GTK+ widget set.

Win32::Internet - Package provides access to Windows internet services from PERL programs.

ASPSeek - An Internet search engine. It consists of an indexing robot, a search daemon, and a CGI search frontend. It Supports Webspaces, which means that the user can combine and perfrom searches within several Web sites simultaneously, instead of browsing each site individually. Search results can be limited by time, site or Web space, and sorted by relevance (PageRanks are used) or date.

ASPSeek is optimized for multiple sites (threaded index, async DNS lookups, grouping results by site), but can be used for searching one site as well. Other features include stopwords and ispell support, a charset and language guesser, HTML templates for search results, excerpts, and query words highlighting.

Pagecast - Submits lists of URLs to search engines. Pagecast makes it easy to submit lists of URLs. It also has more advanced features such as the ability to check the URL's for problematic conditions. It is designed to be simple to use and effective at what it does. Pagecast runs from either the command line or as a mail-robot.

Wget - A network utility to retrieve files from the Web using http and ftp, the two most widely used Internet protocols. It works non-interactively, so it will work in the background, after having logged off. The program supports recursive retrieval of web-authoring pages as well as ftp sites-- you can use wget to make mirrors of archives and home pages or to travel the Web like a WWW robot.

Larbin - An HTTP Web crawler. Larbin uses standard libraries, plus adns. The program is multithreaded but prefers using select instead of a lot of threads (for efficiency purposes). The advantage of Larbin over wget or ht://dig is that it is much faster and very easy to customize.

Webbase - Web crawler is an Internet crawler. It crawls the Web to get documents, stores them locally, and builds a full text MySQL database with them. It can also visit sites regularly to make sure the document is still there and update it if it changes. The database uses the local copies of documents to build a searchable index.

Flinch - Flinch is a Web link checker. It can periodically check all the external links on your Web pages and produce HTML reports of its findings. If a Web resource at the end of a link has not been reachable for a few days, Flinch can send you an email.

OpenFTS - OpenFTS is an advanced PostgreSQL-based search engine that provides online indexing of data and relevance ranking for database searching. Close integration with database allows use of metadata to restrict search results.

Alist - Alist is a program that collects hardware and software information about systems and stores it in a database for users to browse and search via a Web interface. The program consists of three parts: a client portion that collects the information, a daemon that receives data sent from clients, and a CGI that displays and lets you search for information.

HTTrack - HTTrack is an offline browser utility. It lets you download a Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Just by opening a page of the mirrored Web site in your browser you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.

LinkCheck - LinkCheck automates checking a Web site for broken links. It creates a report of broken links, download times, and every link and FTP connection on the site. It is handy for Web designers who need to check over their site, but can also be installed as a CGI program so Web designers can check via a Web page. It flags pages that are particularly slow to download. It reports server types and the date a file was last modified, and validates and reports temporarily moved pages while checking the new location.

Namazu - A full-text search engine intended for easy use. Not only does it work as a small or medium scale Web search engine, but also as a personal search system for email or other files.

SWISH++ _ A Unix-based file indexing and searching engine, typically used to index and search files on web sites. It was based on SWISH-E although SWISH++ is a complete rewrite. SWISH++ was developed to circumvent my difficulties with using the SWISH-E package. SWISH++ has been ported to compile and run under Microsoft Windows by Robert J. Lebowitz and Christoph Conrad.

WebWader - WebWader is a tool to automatically and thoroughly browse a Web site. First it scans pages stored in the site, building a tree of HTML pages. This tree is then traversed automatically and each page is displayed by a browser. WebWader is especially useful for webmasters who want to redesign a site with a complex hierarchy of files. The program travels through a tree of HTML pages and displays each one in a window, allowing for quick changes in the page's appearance.

DataparkSearch - DataparkSearch is an Internet and Intranet search engine tool. It supports for http, https, ftp, nntp and news URL schemes, and htdb virtual URL scheme support for indexing SQL databases, as well as text/html, text/xml, text/plain, audio/mpeg and image/gif mime types. External parsers can support other document types. Supports multilangual content, spelling variation and stemming, and synonym lists.

Wgrab - Wgrab is a Perl script for selectively downloading parts of a Web site from the command line and storing them in the local filesystem. You can use date-based and/or number-based patterns to specify which documents to download, and regular expressions to restrict which references are to be followed when performing recursive retrieval.

Sirobot - Sirobot is a Perl script that downloads Web pages recursively. The main advantage over wget is its ability to get them concurrently, and is able to continue aborted downloads and convert absolute links to relative ones. It uses curses, can do HTTPS, and has a pattern-matching filter to prevent you from downloading the whole Internet.

phpLinkValidator - PhpLinkValidator validates links in a specified HTML document and, optionally, all related files. It searches the specified file/page for all links and it runs itself recursively with all linked files.