In-reply-to: ccoprmm@prism.gatech.EDU's message of 7 Sep 91 02:02:17 GMT
Newsgroups: comp.sources.wanted,comp.archives.admin,comp.org.eff.talk,bionet.general,comp.protocols.tcp-ip
Followup-To: comp.archives.admin
Subject: Re: knowbots: searches on inet sections
References: <36002@hydra.gatech.EDU>
--text follows this line--
In article <36002@hydra.gatech.EDU> ccoprmm@prism.gatech.EDU (Michael Mealling) writes:

   Has anyone been doing anything with a species of program called a "knowbot"?
   Essentially they scout the net (news, archives, mail lists, etc) for articles
   that would be of interest to the user. If you've read David Brin's _Earth_,
   they are referred to as "ferrets". (I kind of like the name "meme-hound").  

There's a pile of discussion going on all over in assorted places
about these things, here's some pointers to where the traffic is and
who's talking.  This is by its very nature a sketchy report, since
the whole premise of this is that there are people of every field and
persuasion looking for tools and techniques to look for interesting
articles, and so you'd bound to find traffic and experts just about
anywhere.

"Knowbot" is trademarked, not a generic term, take care how you use
it.  There's a service which calls itself the "Knowbot Information
Service (V1.0) Copyright CNRI 1990 All Rights Reserved." which you can
use by telnet to sol.bucknell.edu, port 185; it's a system that looks
up user email addresses by name.   More information about it can be
found at 
	nri.reston.va.us:/rdroms/*
including the text of an internet draft on the subject.  (I'll have to
check if and when this was published as an RFC.)

KIS is reviewed in the June 1991 issue of _Boardwatch_ Magazine.
There's a short account by Vincent Mazzarella of a presentation by Mr
David Ely of CNRI (the Corporation for National Research Initiatives)
to the ACR Workshop on Computer Networks in Radiology Research in
bionet.general (March 1991) where he refers to systems like "Grateful
Med" as a Knowbot:

	One can build a KNOWBOT locally and then send this to a larger
	machine, have the program run on that larger machine, returning
	processed data to the local system.  GRATEFUL MED is a good
	example of this concept in which a search routine is built at the
	local PC and is then sent to a large database for a search.

The first thing I turned to to start to prepare this was WAIS, the
Wide Area Information Service. A search for "knowbot" in the
"matrix_news" database on quake.think.com yielded a good article that
references a meeting of the Coalition for Networked Information (CNI)
working group on "Developing a Framework for Network Directories and
Related Services", held June 3-4 at Stanford.  This meeting brought
together some of the interested parties; you can find an account of it
via anonymous in the file 
	quake.think.com:/pub/mids/matrix_news/cni.3
or by using the WAIS source in
	quake.think.com:/pub/mids/matrix_news.src
Rather than quote bits of the article it would be most useful to fetch
the whole thing and read it.  WAIS itself falls into this category
generally, since there's a division of labor between the search
engines running on remote machines like Connection Machine
supercomputers, and the local display and user interface systems
running in point-and-click (Mac) or part of your editor (GNU Emacs)
environments.

comp.archives is an example of a real-life system which uses some of
these tools.  A set of programs systematically filters through days
worth of usenet news postings looking for key words and phrases that
refer to materials available for anonymous FTP from remote systems.  A
human editor pares these hits down somewhat to wipe out some noise and
to perhaps add additional information to each posting.  Though the
scheme to date has employed a single moderator at a time in this role,
it's lacking but for volunteers to have multiple people each tracking
a more specialized subset of the net and sharing their sense for
important new additions to archives with others.  (Further discussion
to comp.archives.admin.)  I don't know that the scanning code has yet
been released (Adam?)

For the limited problem space of "ftp'able files" the searching
necessary is really quite minimal; for searches on fuzzier topics
you'd probably want to go with a system that had a priori full text
indexing of probably relevant groups combined with regular scanning of
usenet news + mailing lists for key words and phrases which would have
a lower hit rate but might lead you farther afield.  One interesting
project that also focuses on searching through usenet is the
"Pasadena" system; there will be a presentation on it by Mitch Wyle at
SIGIR in Chicago in October.

The "archie" service (see quiche.cs.mcgill.ca:/archie/doc/) is also
worth noting, both in the context of providing before-the-fact
searching for FTP retrieval and also as a general tool for locating
things around the net in a knowbot-style fashion.  

There's a great deal to be gained if what you're dealing with can be
treated not just as great amorphous gobs of information; the more that
you can deduce explicit or implicit structure in the information the
more focused and precise your searches can be.  As an extreme example
the work of Malone at MIT ("Information Lens") to provide users with
tools to create structured e-mail messages and then also tools to
filter through their mail based on the added structred information
have proven to be very powerful.  Unfortunately, netnews tends to be
anything but a nice small homogenous cooperating community, so those
tools would have less success when unleashed on the daily torrent of
netnews.  

One reaction to this flood of news is to try to locate sources which
have an ear open to the wide world of what's out there and which can
condense and edit it down to a reasonable stream of especially
interesting materials.  Sometimes you'll find volunteers up to the
task, and you'll often see people forwarding off particularly
interesting mailing list or newsgroup materials to their friends who
don't have the time or energy (or tools) to plow through netnews.

Paper- (or email-) based electronic information sources have started
to pop up; no doubt more are coming.  For $xx-$xxx/year (the price of
a good magazine, let's say, but less than an academic journal
subscription) you get the services of an editorial staff, some on-line
archives managed for the task, and a steady stream of interesting
materials.  I doubt that anyone gets rich from producing one of them
any time soon, but a well-edited and properly produced periodical
should be able to support itself from readers of the net.

-- 
Edward Vielmetti, vice president for research, MSEN Inc. emv@msen.com
       MSEN, Inc. 628 Brooks Ann Arbor MI 48103 +1 313 741 1120
 for information on MSEN products and services contact info@msen.com


Notes.

There was a review of comp.archives in an earlier issue of Boardwatch.

	The masthead says that Boardwatch comes out 12 times a year.
	It gives these electronic mail addresses:

	Internet:	jack.rickard@csn.org
	GEnie:		JACK.RICKARD
	CompuServe:	71177,2310
	Fidonet:	104/555
	MCI Mail:	418-7112

WAIS can be had from quake.think.com:/pub/wais/.  A (beta test) ascii
terminal interface is running at hub.nnsc.nsf.net, log in as "wais".

The "Matrix News" materials are on quake.think.com:/pub/mids/matrix_news/ 
A subscription form is in the file "0subscribe".
	Matrix News
	Matrix Information & Directory Services
	701 Brazos, Suite 500
	Austin, TX 78701-3243
	U.S.A.
Thanks to John Quarterman for this excellent periodical.

bionet.general quotations were found in the WAIS server "biosci"
running at genbank.bio.net.

I don't have a good citation for Information Lens.

comp.archives has not been written up extensively in any periodical
(and that's probably my fault).