In-reply-to: ccoprmm@prism.gatech.EDU's message of 7 Sep 91 02:02:17 GMT Newsgroups: comp.sources.wanted,comp.archives.admin,comp.org.eff.talk,bionet.general,comp.protocols.tcp-ip Followup-To: comp.archives.admin Subject: Re: knowbots: searches on inet sections References: <36002@hydra.gatech.EDU> --text follows this line-- In article <36002@hydra.gatech.EDU> ccoprmm@prism.gatech.EDU (Michael Mealling) writes: Has anyone been doing anything with a species of program called a "knowbot"? Essentially they scout the net (news, archives, mail lists, etc) for articles that would be of interest to the user. If you've read David Brin's _Earth_, they are referred to as "ferrets". (I kind of like the name "meme-hound"). There's a pile of discussion going on all over in assorted places about these things, here's some pointers to where the traffic is and who's talking. This is by its very nature a sketchy report, since the whole premise of this is that there are people of every field and persuasion looking for tools and techniques to look for interesting articles, and so you'd bound to find traffic and experts just about anywhere. "Knowbot" is trademarked, not a generic term, take care how you use it. There's a service which calls itself the "Knowbot Information Service (V1.0) Copyright CNRI 1990 All Rights Reserved." which you can use by telnet to sol.bucknell.edu, port 185; it's a system that looks up user email addresses by name. More information about it can be found at nri.reston.va.us:/rdroms/* including the text of an internet draft on the subject. (I'll have to check if and when this was published as an RFC.) KIS is reviewed in the June 1991 issue of _Boardwatch_ Magazine. There's a short account by Vincent Mazzarella of a presentation by Mr David Ely of CNRI (the Corporation for National Research Initiatives) to the ACR Workshop on Computer Networks in Radiology Research in bionet.general (March 1991) where he refers to systems like "Grateful Med" as a Knowbot: One can build a KNOWBOT locally and then send this to a larger machine, have the program run on that larger machine, returning processed data to the local system. GRATEFUL MED is a good example of this concept in which a search routine is built at the local PC and is then sent to a large database for a search. The first thing I turned to to start to prepare this was WAIS, the Wide Area Information Service. A search for "knowbot" in the "matrix_news" database on quake.think.com yielded a good article that references a meeting of the Coalition for Networked Information (CNI) working group on "Developing a Framework for Network Directories and Related Services", held June 3-4 at Stanford. This meeting brought together some of the interested parties; you can find an account of it via anonymous in the file quake.think.com:/pub/mids/matrix_news/cni.3 or by using the WAIS source in quake.think.com:/pub/mids/matrix_news.src Rather than quote bits of the article it would be most useful to fetch the whole thing and read it. WAIS itself falls into this category generally, since there's a division of labor between the search engines running on remote machines like Connection Machine supercomputers, and the local display and user interface systems running in point-and-click (Mac) or part of your editor (GNU Emacs) environments. comp.archives is an example of a real-life system which uses some of these tools. A set of programs systematically filters through days worth of usenet news postings looking for key words and phrases that refer to materials available for anonymous FTP from remote systems. A human editor pares these hits down somewhat to wipe out some noise and to perhaps add additional information to each posting. Though the scheme to date has employed a single moderator at a time in this role, it's lacking but for volunteers to have multiple people each tracking a more specialized subset of the net and sharing their sense for important new additions to archives with others. (Further discussion to comp.archives.admin.) I don't know that the scanning code has yet been released (Adam?) For the limited problem space of "ftp'able files" the searching necessary is really quite minimal; for searches on fuzzier topics you'd probably want to go with a system that had a priori full text indexing of probably relevant groups combined with regular scanning of usenet news + mailing lists for key words and phrases which would have a lower hit rate but might lead you farther afield. One interesting project that also focuses on searching through usenet is the "Pasadena" system; there will be a presentation on it by Mitch Wyle at SIGIR in Chicago in October. The "archie" service (see quiche.cs.mcgill.ca:/archie/doc/) is also worth noting, both in the context of providing before-the-fact searching for FTP retrieval and also as a general tool for locating things around the net in a knowbot-style fashion. There's a great deal to be gained if what you're dealing with can be treated not just as great amorphous gobs of information; the more that you can deduce explicit or implicit structure in the information the more focused and precise your searches can be. As an extreme example the work of Malone at MIT ("Information Lens") to provide users with tools to create structured e-mail messages and then also tools to filter through their mail based on the added structred information have proven to be very powerful. Unfortunately, netnews tends to be anything but a nice small homogenous cooperating community, so those tools would have less success when unleashed on the daily torrent of netnews. One reaction to this flood of news is to try to locate sources which have an ear open to the wide world of what's out there and which can condense and edit it down to a reasonable stream of especially interesting materials. Sometimes you'll find volunteers up to the task, and you'll often see people forwarding off particularly interesting mailing list or newsgroup materials to their friends who don't have the time or energy (or tools) to plow through netnews. Paper- (or email-) based electronic information sources have started to pop up; no doubt more are coming. For $xx-$xxx/year (the price of a good magazine, let's say, but less than an academic journal subscription) you get the services of an editorial staff, some on-line archives managed for the task, and a steady stream of interesting materials. I doubt that anyone gets rich from producing one of them any time soon, but a well-edited and properly produced periodical should be able to support itself from readers of the net. -- Edward Vielmetti, vice president for research, MSEN Inc. emv@msen.com MSEN, Inc. 628 Brooks Ann Arbor MI 48103 +1 313 741 1120 for information on MSEN products and services contact info@msen.com Notes. There was a review of comp.archives in an earlier issue of Boardwatch. The masthead says that Boardwatch comes out 12 times a year. It gives these electronic mail addresses: Internet: jack.rickard@csn.org GEnie: JACK.RICKARD CompuServe: 71177,2310 Fidonet: 104/555 MCI Mail: 418-7112 WAIS can be had from quake.think.com:/pub/wais/. A (beta test) ascii terminal interface is running at hub.nnsc.nsf.net, log in as "wais". The "Matrix News" materials are on quake.think.com:/pub/mids/matrix_news/ A subscription form is in the file "0subscribe". Matrix News Matrix Information & Directory Services 701 Brazos, Suite 500 Austin, TX 78701-3243 U.S.A. Thanks to John Quarterman for this excellent periodical. bionet.general quotations were found in the WAIS server "biosci" running at genbank.bio.net. I don't have a good citation for Information Lens. comp.archives has not been written up extensively in any periodical (and that's probably my fault).