WebIndex

Introduction

There is a major need for expert mediated digital libraries (VLs) of well-selected and described links to scholarly and educational resources. VLs are one important component in providing for the Internet resource-finding needs of the academic community. Generalized commercial Web search engines, and even second generation engines such as Google, are often unable to produce consistently relevant results given their generalized focus, the immense amount of territory they cover, and the great number of audiences they serve (Chakrabarti et al., 1999a and 1999b; McCallum et al., 2000). Other problems here include a lack of intelligible descriptive information contained in results displays, unobjective representation of search results (since site placement can be purchased), distracting messages and advertisements, and the possibility in the future of becoming fee-based at costs that academia might not be able to bear.

Expert-mediated, objective finding tools that describe and provide well organized, uniform, integrated access are needed because they make important scholarly and educational resources visible and of use to researchers and students. The response to this need can be seen in the development over the last few years of digital libraries of categorized links (i.e.,Web directories, indexes, portals) of all sorts including the digital libraries which are now augmenting and produced by, as of the last couple years, the general search engines themselves, such as AltaVista. We believe that in providing a balanced effort that is positioned between labor intensive MARC cataloging and traditional library catalog approaches (1 to 1 _ hour per record created), generalized minimal indexing approaches of digital libraries such as Yahoo! (1 _ minute per record) and the general search engines, academic digital libraries, such as WEBINDEX (25 minutes per record), have been providing an important finding tool service. This service can be characterized, in most cases, by expert selected and described resources that cohere to various standards (new and traditional) in academic information organization and retrieval, while emphasizing efficient and streamlined approaches to content building.

The Web will continue to grow at an ever-accelerating pace. At their current levels of growth, the summed content of the community of academic digital libraries will resultingly represent an increasingly smaller portion of the total number of worthwhile Internet resources. Obviously, there are many challenges that digital library managers now face and need to meet in order to continue to sustain themselves while providing comprehensive, or even representative, coverage of useful research and educational resources.
The discussion below points out some of these challenges and outlines some possible solutions including those that our VL, WEBINDEX, is currently pursuing.

Discussion

WEBINDEX Description

WEBINDEX is currently (as of April 2000) a collection of close to 20,000 librarian selected and described scholarly and educational Internet resources. It was one of the very first library originated, Web-based information services of any kind. WEBINDEX has been created by University of California, California State University and other librarians working together in a cooperative effort. WEBINDEX is funded by the Library of the University of California, Riverside, the Fund for the Improvement of Post-Secondary Education (FIPSE, U.S. Department of Education) and the U.S. Institute for Museum and Library Services (IMLS National Leadership Grant). WEBINDEX is easy to use and was designed to accommodate multiple skill levels among users. Sophisticated queries and browsing are supported. Most major disciplines are well covered via access to important databases, e-journals, e-texts and digital collections, among others. Our funding from FIPSE and IMLS is for research in developing software, labor efficiency improvements and cooperative organizational solutions for challenges which WEBINDEX and, by extension, the digital library community at large, are facing.

The Need for Library-based Academic Digital libraries

There are many pressing needs for library-based academic VL type finding tools. Driven by library user information needs, libraries have increasingly been augmenting print-source indexing and cataloging with efforts in Internet resource description and organization. Skills transfer easily. Organizing important information for scholarly and educational uses is our mission and is what we have traditionally done as librarians. The proof is in the pudding: academic, library-based digital libraries of various sizes have begun to proliferate by the hundreds. Librarians, albeit in a generally disorganized fashion, are finding the time and see this as an important expenditure of effort for their users. An incentive apparently exists here as does a general approach or model. Librarians are a natural group to organize in the cause of serious VL building. The challenge now is to focus this energy.

Academic and educational community needs in Internet finding tools are different from most generalized search engine user communities. Academic library VLs proliferate because generalized search engines no longer scale, not so much in regard to Internet coverage, but rather in the sense that, given their general multiple audience focuses, they cannot consistently provide serious researchers with appropriate results. Moreover, the goal of their efforts naturally is to return maximum profit to their owners and this may in the long run be inconsistent with the needs of academic institutions for constant, high-usage access to finding tools that offer affordable, objectively presented, and described content. Just as academics have required expert-mediated, uniform, and objective practices in organizing print information and have thus organized and provided institutional support for the existence of libraries to provide access to information above and beyond the open book market, there is now a similar crucial institutional need for library-based efforts, approaches, and standards (modified appropriately) in organizing academically relevant Internet information. In serving our community of users, academic libraries may very well become the natural home and energy source behind organized, concerted, high quality, and properly funded academic Internet finding tools. Our project and a number of others indicate that new views of the roles of libraries and librarians are taking form. So envisioned, how do we, as librarians providing VLs, move from today’s uncertainties to a more sustainable position?

Challenges and New Directions

Note that, while the following are some directions for the VL community to explore, there is no single “right” solution. The key to VL sustainability lies in the creative exploration of and application of flexible solutions in many difficult areas by many different projects. There are, in addition, numerous other challenges and directions not mentioned here. Many of these solutions have been in the process of being explored and discussed and even implementated in small ways, in fits and starts, over the last couple of years.

Direction A: Improving Cooperative Organization and Aggregation Effort Through

1. Interoperability Among Multiple Tools
Many organizations which provide VL services are, in order to scale, increasingly interested in boosting their coverage through interoperating with (e.g., passing searches among) other services. This is a useful effort and one which WEBINDEX is following. The interoperability standards and conventions that should come from efforts such as IMesh, Isaac Network, and ROADS should be important to all VL service providers. Cooperating partners in such projects are concerned with offering reasonably seamless searching of what are often very heterogeneous VL resources. Approaches here include building a common interface for many distributed resources and/or allowing each partner to retain its own interface through which its resources (usually primarily) and those of others (usually secondarily) are queried. Work that leads to passing queries and the development of similar field structures and standards in resource description is inherent here. Maybe more importantly, this kind of interoperability is perhaps the simplest means for initiating cooperation among VLs. This direction is, organizationally, a useful and strategic first step which allows both for the retention of institutional identity and “ownership” while initiating substantial cooperative effort. Such work may eventually lead to more intense forms of cooperation which may yield even better tools.

2. Cooperation in a Single Open Tool
Some of the problems and promises of interoperability would be equally present here. However, some problems may be solved by aggregating effort in a single open, cooperatively built tool that has its own presence and runs as a separate organization. Given that interoperability and cooperation overlap greatly, they might even be seen as different stages of the same organizational effort: a loose confederated effort leading, as appropriate, into a more consolidated and focused effort. Multiple entities can be involved but the service provided by the tool would be managed by a cooperative organization that existed separately from its constituents.

A single cooperative tool and organization might provide for greater efficiencies and improved features in a number of areas including:
o Greater uniformity and perhaps quality in metadata development, interface, and other system features;
o Speedier, more responsive systems development, and new technology implementation generally;
o Better query and system response times through a centralized, non-distributed system architecture that wouldn’t be hampered by accommodating the lowest common denominator or least developed/slowest participants;
o Quicker decision-making through a less heterogeneous organization;
o Elimination of redundancy in effort in not only content building but systems development as well;
o Greater success in marshalling more resources around a single, well-defined funding target;
o Pooling resources and effort around a single mutually owned and cooperatively created tool; and,
o Supporting institutional identity and “ownership” through custom interface work to create appropriate look and feel.

3. National NetGain
WEBINDEX is beginning to work with selected college and university libraries to methodically and organically develop a national cooperative network of content and system builders. This effort is known as National NetGain. NetGain is an effort with cooperating partners but one which is focused around building a single tool (as described above). We believe that there is a need for a number of national level, coordinated efforts such as WEBINDEX. Our efforts will benefit the library community and all participants: contents entered are owned by their authors and/or the organizations that authors are affiliated with as well as the central NetGain project. System software will be co-developed by participants and be placed in the public domain. User interface options are flexible and would meet the needs of cooperating organizations regarding functionality and “look and feel”. NetGain could help save resources among libraries which are now spending a great amount of energy to create frequently redundant content. Together we could do what we do better and save significant resources as a result than we can by continuing to work individually. If you or your institution are interested in NetGain contact us.

Direction B: Improving Systems, Open Software, and Tiered Levels of Resource Description

Providing greater resource coverage while at the same time retaining quality in resource selection and description is a crucial key to sustainability. Achieving a good balance between respectable finding tool “reach” (typical of the larger Web search engines) and metadata quality (typical of the academic VLs) is a challenge involving technological as well as labor and other economic concerns. Part of the answer is in developing “smart” systems that are really hybrids which amalgamate expert based VL approaches with machine-learning based crawling and classification systems. This hybrid system will necessitate the flexible interweaving of expert effort increasingly with machine assistance in the major labor and time consuming tasks involved in finding tool collection development, resource description, and collection maintenance. Also involved will be the development of an approach to flexibly allocating expensive human expertise in resource discovery and description in a tiered way according to the information value of the resource to its audience. This hybrid system would combine the best of both worlds – the expertise of the librarian with the reach and assistance of the focused, smart crawling and classification system.

1. Hybrid Digital library/Smart Crawling and Classification Systems
Major advancements are being made in machine-learning based “smart”, focused crawling, and automatic resource description and classification software. These technologies will be of great use in academic Internet finding tools and VLs. While some VLs (e.g., Social Science Information Gateway) do gainfully employ relatively simple crawling and auto-classification technology (e.g., Harvest-NG), there are many ways in which this work could be significantly enhanced. Recent literature reviewing machine learning in these and closely related applications is available (Paepcke, 1998; Chakrabarti et al., 1999b; Glickman and Jones, 1999; McCallum, 2000). A number of tools have incorporated these approaches including Google, , Cora, and the New Zealand Digital Library, among others. Much of this software is in the public domain.

Crawlers are starting to do their work in much more efficient, focused, and accurate ways (as opposed to undirected, shotgun crawling using simple filters) employing, among other techniques, reinforcement learning techniques (McCallum, 1999a and 1999b; Rennie and McCallum, 1999). Smart crawling systems, with increasing accuracy, can find and harvest higher quality resources than previously, by being focused on and working within what are essentially self-defining Internet communities (think of citation analysis via Science Citation Index), such as many academic disciplines (Chakrabarti et al., 1999b; McCallum et al., 2000, 1999a, 1999b; Brin and Page, 1998; Gibson, 1998; Kleinberg, 1998; Chang, n.d.). Classifiers can now be “trained”, with increasing effectiveness (depending on training data, information/document types crawled, and classification scheme complexity), to automatically recognize and categorize potential high quality documents and sites using increasingly smaller and less well labeled training data, including bibliographic records, than ever before (Craven, 1999; R. Jones, 1999; Riloff and Lorenzen 1999; Nevill-Manning, 1999; Dumais, 1998; Hofmann, 1999 and 1998). Techniques featuring multiple document and site content evaluation schemes, each correcting for the deficiencies of the others, are promising (Frietag, 1998; Cleary and Trigg, 1998). We’ve been working with approaches that combine document (and site) similarity analysis (McCallum et al., 2000), linkage analysis (see Kleinberg 1998; Chakrabarti, 1999a) and a wide array of statistical truing techniques. These approaches, as judged by the finding tools mentioned in the preceding paragraph, have started to produce increasingly effective, accurate, and high quality crawling and automatic classification systems. They represent a major watershed in Internet finding tool advancement and one of which the academic VL community needs to avail itself.

Still there is great room for improvement in machine learning applications in these areas. We view these techniques as providing assistance which will boost time and labor savings and will amplify, not replace, expert effort in content building. The subject and resource description expertise of librarians remains at the core of our approach though it will be greatly augmented through “smart”, machine-assisted collection development (crawling), resource description (classification and indexing), and machine-assisted URL and collection maintenance (content change checking/fixing) software. New roles for software that assists VL experts are being developed as, conversely, new roles for experts working with this class of software are being developed (see below).

What a user will see and benefit from in such hybrid systems are multiple types of records, more in-depth indexing, new approaches to search/browse access, and new approaches to displaying records according to different relevance-to-query ratings. More importantly, the records contained should be, on the whole, of a much higher quality and much more relevant to researchers and students than those typically seen in the large Web engines. And, crucially, there will be many more of them than typically found in standard VLs. In our case, we will continue to feature tens of thousands of expert created records while augmenting these with millions of crawling system created records. Both expert and crawling system created records will feature fielded indexing as well as near full-text indexing and retrieval. Expert records will be of an increasingly more important, general, and and/or reference oriented nature while the crawling system records will provide the critical mass to provide for the information specificity necessary to enable the detailed searching often absent in VLs.

2. Open Source Software Development
Public domain software (GNU GPL open source licensed) is being created that will be of use to the VL community. Our hope is that we can, in a more specialized area, contribute in a small way to digital library system development in the same way that the creators of Linux and Apache have contributed to operating system and Web server software advancement. If co-development interests you, contact us.

3. Resource Description in a Hybrid System: Three Tiers of Record Quality and Expert Labor Input
Much of our concern with the human side of the hybrid system coin is in developing multiple tiers of labor expenditure in indexing and description that match the quality of the resource. How much human metadata creation effort, beyond basic machine crawling and automated classification and near full-text indexing, needs to be applied? The answer is dependent on your scaling and sustainability strategies and questions of adequate coverage vs. adequate resource description as well as, ultimately, labor costs. The shorter and quicker the description, the more labor that can be expended on increasing coverage, and vice-versa. The time issue is very much involved here. Our goals have been indexing, not cataloging, and adequate, objective description, not reviewing. Within our approach, though, we have always recognized that the higher the scholarly or educational value of a resource, the greater the amount of expert time which can be invested in its description.

We are developing a three tiered approach to resource description, building on new system capabilities, in order to achieve better efficiencies in labor allocation. These tiers are:
o Automatically indexed, minimal records for medium to high value resources;
o This plus expert review and augmentation for high value resources;
o These plus exporting records for very high value resources to allied, traditional library cataloging operations.

Using our streamlined record as a foundation, MARC cataloging could be done and the more fully embellished record then moved into traditional library catalogs and/or even back to WEBINDEX. All tiers, would benefit from near-full text indexing (the exception would be the library catalogs importing records from us).

Scenarios for the evolution of a resource’s description in our hybrid system might include:
o a.) A resource is discovered by our crawler/classifier as having value and is either:
 a1.) Tier 1: Automatically classified (filling fields with data from title words and phrases, URL, author supplied metadata as available, author-emphasized text, significant keywords and phrases from full-text, general subjects from our subject tree) or,
 a2.) Tier 2: In the case that the crawler/classifier determines very high linkage and similarity ratings (in comparison with known high quality resources), an expert would be notified to immediately review and enrich the initial auto-indexing (the auto-supplied information would be clarified and checked for accuracy while complex descriptive information such as an annotation or Library of Congress Subject Headings would be supplied). In addition, over time, as usage software detected patterns of high usage of a record for a resource not yet expert-reviewed, it would be flagged for review;
 a3.) Tier 3: If the resource, now with expert created metadata, experiences even higher usage, it would again be flagged for expert review and, if determined to be of extremely high value and significance to warrant fuller description, could be copied to participating or allied cataloging departments (we wouldn’t do it) for cataloging and from there moved into library catalogs and back to WEBINDEX.
This implies that WEBINDEX could import/export and generally “trade” records with traditional catalogs doing Web resource description. Reaching Tier 3 effort would not be common but would apply in cases where the resource, for example, was a mainstream A&I database, e-journal, digital, or digital library, etc. Note that the process would be flexible and wouldn’t necessarily start with step a.). Experts would continue to introduce resources into the system unaided by the crawler. Some resources would be so highly rated by the system that they might be moved to step c.) immediately.

Over the next two to four years the ratio of different types of records might be, as a guesstimate to illustrate scale among the three tiers, something like several million records at the Tier 1 level of crawler/auto-classified records (minimal expert labor. e.g., truing crawls); between 100,000-300,000 records that had rated Tier 2 effort (expert labor at the level of streamlined indexing); and, under 150,000 thousand that had justified labor at the level of Tier 3 (expert labor at the level of MARC cataloging).

Our choices, as part of our scaling strategy, naturally emphasize auto-classification as well as streamlined indexing because expert time and labor are expensive. The Web, as a scholarly and educational resources publishing medium, remains a spectacularly growing mass media; the numbers of resources that will be of use are going to be immense. Immense amounts of intelligent and useful information, both that which is brokered and vetted by scholarly organizations and publishers and that which is not, is becoming available. So, given these trends and that many academic libraries are currently not handling print cataloging backlogs well, we need to become comfortable with new cooperative approaches towards creating effective, minimal metadata. Metadata records need to be created in increasingly automated, streamlined and time efficient ways. Metadata will be created through indexing, not cataloging, approaches, except in rare instances. It will be geared towards providing access and finding (and refinding) information. Metadata will remain simple and capable of being authored by academics and others not trained in cataloging. It will be produced by and/or complementing new machine-assisted means of providing metadata.

Direction C: Developing Labor Savings in VLs and New Roles for Expert Content Builders
Though implied directly or indirectly throughout the discussion, labor savings deserves its own category of concern. VLs, like physical libraries, are not inexpensive to build and maintain. While we believe academic VLs will become crucial in modern scholarly research and education and will have an expanding role among the array of services which serious library systems routinely provide, it is vital that production and maintenance costs be lowered.

Our approaches to machine assistance, then, are not just about increasing the reach and quality of our tool. They are about more efficient, labor saving ways of doing this. Machine-assistance which results in greater efficiencies in the labor intensive VL tasks is critical. Also critical is the refocusing of and best use of expertise so as to make the best use of machine assistance.
How do we take the best of machine learning in our area of application and combine it with human expertise and effort in optimum ways? We see system optimization gains via defining new roles (in addition to continuing with many of the more established roles) for experts in a number of areas. These might include:
• Machine-assisted collection development
Developing core, high value initial crawling domains; Truing crawls, e.g., through lifting (Chang, in press) and focusing (Chakrabarti, 1999a);
• Machine-assisted resource description
Completing and/or Editing base records created through machine-assistance; Developing training data for and expanding and truing machine classification systems; Significantly augmenting high value, machine collected and described resources with in-depth expert collecting and description; and,
• Machine-assisted collection maintenance
Re-finding working URLs and updating content description when records have been flagged as having changed in these areas.

WebIndex

WEBINDEX is a unique Web resource featuring well organized access to a substantial number of important research and educational tools on the Internet. WEBINDEX is notable for its collection of close to five thousand annotated and indexed records with links to selected, university level resources in most major academic disciplines. Information in WEBINDEX is easy to find given the multiplicity of access points provided. It has received, on average, over twenty thousand accesses per week during the last 15 months. In addition, WEBINDEX provides simple and streamlined capabilities for adding resources (e.g., knowledge of HTML – Hyper Text Mark-up Language – is not required) as well as essential maintenance functions (e.g., a URL Checker ensures that links are good).

Most of its important features and services, described in-depth below, remain unique among Internet resource collections.

The WEBINDEX Collection

WEBINDEX contains close to five thousand records. Among these are substantive databases, guides to the Internet and other electronic resources for most disciplines, textbooks, conference proceedings, and journals, as described in our Welcome document. The Biological Sciences WEBINDEX alone provides interactive access to close to three hundred databases.

Separate virtual collections or WEBINDEXs exist in most major areas of university level research and educational interests. These include collections in: a) Biological, Agricultural and Medical Resources; b) Government Information Resources; c) Social Sciences and Humanities (included here as well are General Reference, Business, Education, and Library/Information Studies related resources); d) Physical Sciences, Engineering, Computer Science and Mathematics; e) Internet Enabling Tools (e.g., help, tutorials, navigators of assistance in Internet usage); f) Maps and Geographic Information Systems; g) Visual and Performing Arts; and, h) Instructional Resources (in all disciplines) on the Internet. These are listed in order of comprehensiveness. The first two collections contain seventeen hundred and fifteen hundred resources respectively. The others are strong and compare favorably with collections at other large university systems surveyed. All files are growing rapidly, reflecting the growth of the Internet, and represent well organized collections of selected, high-quality resources.

The WEBINDEX System

WEBINDEX for Users. Among the contributions of WEBINDEX is the essential enrichment or “value added” service, as mentioned, of providing annotations as well as in-depth indexing terminology for each record. This greatly helps faculty and students to quickly retrieve a focused results set, examine the relevance of individual records and then choose among them immediately prior to accessing thus saving considerable time.

In addition to providing search capabilities on conventional subject, key word and title terms, WEBINDEX allows retrieval on terms such as “comprehensive”, “reference resources”, “subject guides”, “virtual libraries” and “searchable databases”, among other terms, which indicate the depth and scope and/or pathfinding nature of specific resources. These qualities and others are often indicated as well in the annotation. Also emphasized is the inclusion of important subject guides to Internet as well as to other electronic and print format resources in most major disciplines. Concerns regarding resource comprehensiveness, quality and general usefulness from an academic perspective guide all WEBINDEX resource selection activities.

WEBINDEX provides a great number of access points, as exemplified in the Biological, Agricultural and Medical resource, through both browse (What’s New, Title, Table Of Contents, Subject, Key Word, Title, hyperlinked indexing) and search (title, subject and key word) modes. Search mode allows the user to quickly retrieve among the collection on the chosen subject(s). Nested, boolean searching capabilities are featured. Search results come back in the form of dynamically created Web pages which are custom HTML documents created on-the-fly and containing the results set unique to each search. By way of contrast, many conventional Web virtual libraries provide minimal searching and/or depend on displays of static lists of resources. In WEBINDEX, each results set record, in addition, features indexing terms that are viewable and in hyperlink form and, when clicked upon, allow further broadening or narrowing of the search as desired. This feature is very useful.

Noteworthy for those browsing is the Table of Contents. In this, all subjects are listed with the titles of all resources on a subject filing under the given subject term. For those who know which resource they want to use, clicking on the title will open it up for use. For those needing a little more information, clicking on a subject will bring up the title plus annotation for all resources listed under that subject.
In-depth description and indexing, careful selection, a considerable number of options in browsing/searching (access points), and ample help within WEBINDEX equate to ease for the academic community in discovering and using important Internet tools.

WEBINDEX for Indexers — Content Development and HTML Document Creation Made Easy
Several features and tools for resource description and indexing make WEBINDEX attractive among those contributing to the building of the files. They combine to make WEBINDEX records easy to add, edit and maintain. Adding URLs (uniform resource locators or resource addresses), titles, subjects, key words and annotations is simple and straightforward (see the WEBINDEX for Indexers guide). Often, within the Windows or Linux environments (among others), adding a record is simply a matter of cutting and pasting.

It is crucial to note that in adding or editing, one does NOT need to know HTML (Hyper-Text Markup Language, the language used to create Web documents). All new records are converted into this format automatically, when they are displayed, by WEBINDEX. This is a useful feature and entails that participants with subject knowledge, but no HTML experience, need not be challenged to learn yet another system perceived as complicated.

Moreover, from our central adder document, data entered for each record is automatically converted to HTML in each of six dynamically created WEBINDEX HTML documents at the time these documents are requested for viewing (i.e., the search results set as well as the following indexes: Titles, Subjects, Key Words, Table of Contents, Date-What’s New, and, soon, Author). By way of contrast, in virtual libraries or subject guides made up of static HTML resource lists/documents, data representing each individual record would have to be manually entered in EACH of the separate HTML indexing documents. WEBINDEX’s ability to allow the adding of record data into one document and have this data then be automatically expressed in several HTML documents (again, results set pages and indexes) saves substantial amounts of time that would otherwise go into manual HTML work. For this reason, WEBINDEX has provided a time-efficient means of building a substantial virtual library for both those with and without significant HTML experience.

Another helpful feature that is indispensable in maintaining large collections of links is that of having a program that will automatically check and flag , at specified intervals, resources which have changed locations. WEBINDEX’s URL Checker does just this. It is important to note in this regard, for example, that five percent of the resources in the life sciences component of WEBINDEX have moved over the last ten month period checked (the majority of these, though, have simply moved to a slightly different place on the same server, thus making them easy to re-find). In addition, a slight variant of the URL checker has been employed to check to ensure that the resource to be added is not a duplicate (important with the current size of most of the files).

As WEBINDEX has grown, various paths for future development have presented themselves. For instance, we have started exploring the development of networks of cooperating selectors/indexers not only at UCR but at other UC campuses as well. Currently librarians from all nine campuses and Stanford participate. WEBINDEX presents significant opportunities for efficient, multi-campus, shared Internet resource collection. Such efforts would result in a significant reduction in redundant collecting efforts at each individual campus. An informal pilot project finished during the summer of 1995 has been successful at outlining many of the issues and concerns that would be a part of such a project and multi-campus pilot projects in the areas of shared government information and maps resource collecting are continuing. We are also examining the need for more uniform indexing languages for the various WEBINDEXs. To this end, we have begun using Library of Congress Subject Headings (see Library of Congress Subject Headings as Subject Terminology in a Virtual Library and an example of an WEBINDEX LC Disciplines List). In addition, shared indexing terminology among WEBINDEXs already exists providing conventions and protocols for describing document types and approaches to geographic descriptors. Another ongoing project includes the modification of the WEBINDEX management system so that both the data files and management programs could be distributed to participating campuses/organizations.

For WEBINDEX contributors, a Web forms-based record adder/editor is provided which allows easy adding/editing/deleting of multiple HTML documents as appropriate. In adding and editing, WEBINDEX features a built-in automatic HTML conversion feature which eliminates the need for contributors to acquire HTML skills or make additions/changes from more than one HTML document (attractive to those with HTML experience).

Easy, efficient maintenance of a large digital library depends on having an automatic way of checking on the working “linkability” of the records included. WEBINDEX provides this in the form of its URL Checker. This checks for and then flags records with links that will not open for later manual verification by databases editors.

It is important to note in ending that many of us who directly provide Internet access to faculty and students have found that expectations for robotic and other Internet navigational or finding tools which provide very minimal human input have not been met. In comparison, the WEBINDEX virtual library is an efficient and academically focused organizing tool which joins the vast, traditional experience of our profession in organizing information with the Internet in order to help create intelligent order among and access to high quality, well selected and annotated electronic resources.

Bibliography

Sergey Brin and Lawrence Page, 1998. “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Computer Networks, volume 30, numbers 1-7, pp. 107-117, and Proceedings of the Seventh International World Wide Web Conference, April 1998, Brisbane, Australia, at http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

Soumen Chakrabarti, Martin van den Berg, and Byron Dom, 1999a. “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proceedings of the Eighth International World Wide Web Conference, May 1999, Toronto, Canada, at http://www8.org/w8-papers/5a-search-query/crawling/

Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, Jon M. Kleinberg, and David Gibson, 1999b. “Hypersearching the Web,” Scientific American, volume 280 (June), pp. 54-60, and at http://www.sciam.com/1999/0699issue/0699raghavan.html

Huan Chang, David Cohn, and Andrew McCallum, in press. “Creating Customized Authority Lists.” Submitted to Digital Libraries 2000, and at http://citeseer.nj.nec.com/pdf/266113

John Cleary and Leonard Trigg, 1998. “Experiences with OB1, an Optimal Bayes Decision Tree Learner,” at http://www.cs.waikato.ac.nz/ml/publications/1998/Cleary-Trigg-OB1.pdf

M. Craven, 1999. “Learning to Extract Relations from Medline,” AAAI-99 Workshop, at http://www.isi.edu/~muslea/RISE/ML4IE/ml4ie.craven.ps

Susan Dumais, John Platt, David Heckerman, and Mehran Sahami, 1998. “Inductive learning algorithms and representations for text categorization,” In: CIKM-98: Proceedings of the Seventh International Conference on Information and Knowledge Management, and at http://robotics.Stanford.EDU/users/sahami/papers-dir/cikm98.pdf

Dayne Frietag, 1998. “Multistrategy Learning for Information Extraction,” Proceedings of the 15th International Conference on Machine Learning (ICML-98), and at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ms-ie.ps.gz

D. Gibson, Jon Kleinberg, and Prabhakar Raghavan, 1998. “Inferring Web Communities from Link Topology,” Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, pp. 225-234, and at http://citeseer.nj.nec.com/pdf/55771

Oren Glickman and Rosie Jones, 1999. “Examining Machine Learning for Adaptable End-to-End Information Extraction Systems,” AAAI-99 Workshop on Machine Learning for Information Extraction (July 19), Orlando, Fla., and at www.isi.edu/~muslea/RISE/ML4IE/ml4ie.glickman&jones.ps

Thomas Hofmann, 1999. “Probabilistic Latent Semantic Indexing,” Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), Berkeley, Calif., and at http://www.icsi.berkeley.edu/~hofmann/

Thomas Hofmann, 1998. “Learning and Representing Topic: A Hierarchical Mixture Model for Word Occurrences in Document,” Conference for Automated Learning and Discovery, Workshop on Learning from Text and the Web, Carnegie-Mellon University, and at http://www.icsi.berkeley.edu/~hofmann/

Rosie Jones, Andrew McCallum, Kamal Nigam, and Ellen Riloff, 1999. “Bootstrapping for Text Learning Tasks,” In: IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, and at http://www.cs.cmu.edu/~rosie/papers/bootstrap-ijcaiws99.ps

Jon M. Kleinberg, 1998. “Authoritative Sources in a Hyperlinked Environment,” In: Howard Karloff (editor). Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, and at http://citeseer.nj.nec.com/pdf/87928

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, in press. “Automating the Construction of Internet Portals with Machine Learning,” at http://citeseer.nj.nec.com/pdf/196313

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, 1999a. “A Machine Learning Approach to Building Domain-Specific Search Engines,” In: Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), and at http://citeseer.nj.nec.com/pdf/54866

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, 1999b. “Building Domain-Specific Search Engines with Machine Learning Techniques,” AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, and at http://citeseer.nj.nec.com/pdf/64116

Craig G. Nevill-Manning, Ian H. Witten, and Gordon W. Paynter, 1999. “Lexically-Generated Subject Hierarchies for Browsing Large Collections,” International Journal of Digital Libraries, volume 2, number 3, pp. 111-123, and at http://sequence.rutgers.edu/~nevill/

Kamil Nigam, Andrew McCallum, Sebastian Thrum, and Tom Mitchell, in press. “Text Classification from Labeled and Unlabeled Documents using EM,” Machine Learning Journal, and at http://citeseer.nj.nec.com/pdf/14059

Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula, Junghoo Cho, 1998. “Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies,” Stanford Digital Library Technologies Working Papers, SIDL-WP-1998-0099, at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1998-0099

Jason Rennie and Andrew McCallum, 1999. “Using Reinforcement Learning to Spider the Web Efficiently,” In: Proceedings of the 16th International Conference on Machine Learning (ICML-99), and at http://citeseer.nj.nec.com/pdf/7537

E. Riloff and J. Lorenzen, 1999. “Extraction-Based Text Categorization: Generating Domain-specific Role Relationships Automatically,” In: Tomek Strzalkowski (editor). Natural Language Information Retrieval. Boston: Kluwer, and at http://www.cs.utah.edu/~riloff/psfiles/nlp-ir-chapter.ps

Leave a Reply

Your email address will not be published. Required fields are marked *