Michael J. Miller: Help! I'm Drowning in Data!
PC Magazine -- May 16, 1995

Michael J. Miller


Help! I'm Drowning in Data!


On-line search engines will help us find the most relevant information on the Internet.

Have you ever tried looking for a specific bit of information on any of the on-line services or on the Internet? If your answer is yes, then you've undoubtedly experienced the frustration of knowing that what you want is available, but buried. Either you don't know how to find it, or you're sick and tired of having to dig through piles of data just to get one helpful piece of intelligence.

Finding what you want on-line isn't going to get any easier. In the coming years, more information will be available on-line. Just look at how quickly World-Wide Web sites are proliferating. But take heart: Help may be on the way.

EDITORS TO THE RESCUE

How are you going to sort through all of this data? Maybe you won't have to. If you read my column regularly, you know that I believe most people will get the bulk of their on-line information through filters or editors, much in the way we now receive information filtered through television, newspapers, and magazines.

In the future, however, you'll be able to customize filters and other on-line editors. Already, PED Software's Journalist program lets you create a customized "electronic newspaper" filled with information you've told the program to pull down from CompuServe or Prodigy. Even better is Personal Journal, a new and customizable electronic version of The Wall Street Journal, which lets you pick which companies you want to track and which columns you want to read.

These two products are great, but they're just a first step. There will still be times when you're going to want some information that you haven't predefined or that doesn't come in through your regular channels. That's when you'll need to grab a shovel and once again personally delve into collections of data.

In "Publish Without Paper!" (February 7, 1995), we looked at a number of electronic publishing tools, ranging from hypertext authoring tools designed for use on a LAN to HTML tools for publishing information on the Web. Now we're beginning to see these tools combined to form search engines designed for managing large collections of articles or other data on the Web.

SEARCHING THE WEB

On the Web, search tools grow out of WAIS (Wide-Area Information Server), which is now available in both freeware and commercial versions. WAIS typically runs on a Unix server, but we're beginning to see some versions designed for Windows NT servers. Thus far, it's primarily been a tool aimed mostly at public sites, not commercial ones, but that's changing.

What we need is a tool to help us determine which data is really relevant to what we're looking for. Why? Because the problem with a large collection of documents is that it's too easy to find too many answers to any query you can come up with. Search engines are now being designed to go beyond simple, broad-band searches.

One of the best approaches is used by Topic, from Verity. Topic has been a popular search engine used in a variety of products, such as Lotus Notes 3.0 and Adobe Acrobat 2.0. It uses both keywords and information searching to come up with a ranking of how relevant each document is to what you are searching for. You might get back a list of a hundred documents that match your criteria, but they would be listed in order of the relevance that Topic assigns. (For a demo, go to http://www.verity.com.)

A different approach is offered by Architext, which uses what it calls context-based searching. Architext lets you enter a query and then comes up with the data you want based on the content of the documents themselves. It tries to figure out the content of the documents based on the context that the words are in. The result is that Architext's system will find stories that don't have any of the words in your search, but that do have the same general meaning. For instance, you might look for "presidents" and find an article that references "Bush and Clinton." In theory, it won't find things with the right words but the wrong meaning. This engine probably won't be right for everyone, but its approach is certainly interesting. (Check out the demo at http://home.mcom.com/MCOM/search_docs/index.html.)

Oracle wants to be the big database for documents and multimedia elements on the Internet, and it already has a variety of products available. The most compelling is Context, which can go through a variety of documents and create its own summary, pulling about three key sentences from each of the documents it selects. (You can check out a demo at http:// www.oracle.com.)

There are other approaches to letting you search for data on the Web or on other wide-area networks. Personal Library Software now has an Internet version of its Personal Librarian called PLServer. (A demo is available at http://www.pls.com.)

When it comes to organizing your information, Lotus isn't about to be left out. The company is developing a group of products, collectively known as InterNotes, that will let you access Internet newsgroups from within Lotus Notes.

InterNotes products will also let you turn Notes databases into HTML documents that can be published on the Web. (For more info, see the sidebar "Lotus Ties Notes to the Internet" in this issue's story "The Internet Means Business.")

Folio Corp., which makes Folio Views--our Editors' Choice among LAN-based hypertext publishing tools--is getting in the act with an Internet version of Folio Views that lets you publish your "infobases" on the Internet and search those documents with things like relevance-based search tools. This product is more oriented toward corporate databases, not commercial publishers. Folio Views does, however, show us how much easier it's going to be for people to become publishers of large amounts of data. (Take a look at http://www.folio.com.)

Folio has developed a relationship with the Copyright Clearance Center (which represents most major publishers), so that companies that want to place documents or articles on a corporate server will be able to let employees search for those documents legally.

Both Folio and Lotus seem primarily aimed at companies who want to publish corporate data for their own customers, while many of the other products seem aimed at people who want to publish a lot of information for the general public.

Turning our PCs into intelligent information appliances capable of finding the data we need will involve combining the features of many of these products--from hypertext systems and indexing programs to electronic-document packages, browsers and communications packages, intelligent agents, and multimedia databases. We have a long way to go, but getting there is crucial if we're going to find that one piece of information in the huge collection of data that will be available.


Full Text COPYRIGHT Ziff-Davis Publishing Company 1995