Write an essay in which you discuss and critically evaluate the role, current use and potential future use of metadata, including particular metadata schemas, in supporting more efficient and effective information retrieval within the context of either web sites or organisational Intranets.
Cawkell (1999), describes the web as 'an informal mixture of records linked to pages consisting of articles, advertisements, people's CVs, situations vacant, pornographic pictures, movie reviews, motion video, minutes of meetings, recipes and other items covering all aspects of human endeavour'. Clearly there is a need to bring order to the chaos of the web. In this essay I will introduce metadata, and the role it plays in bringing order to the afore mentioned chaos. I will discuss some of the shortcomings of metadata and how organisations attempt to overcome them with metadata schemes and standards. I will evaluate the pros and cons of current metadata usage today, then look at how things might change in the future.
Metadata can be defined as data about data. It 'describes how and when and by whom a particular set of data was collected, and how the data is formatted', (Webopedia, 2004). Metadata is structured data that increases the chances of information retrieval by describing the information resource. Unlike metadata for information held on paper or databases, web based metadata is contained within the document, and is a key component of the information resource. 'It is the Internet-age term for information that librarians traditionally have put into catalogues and it most commonly refers to descriptive information about Web resources' (Dublin Core, 2003).
The key roles of metadata are to: enable discovery of information based on relevance; organise information effectively; facilitate interoperability between diverse systems, software, data structures, and interfaces; identify the target information resource; and provide archiving and preservation facilities so that the linage of the information can be traced. (Understanding Metadata, 2004, p.1). The addition of ownership and currency details can be used as an indicator to the quality of the information resource (Armstrong, 1999).
El-Sherbini (2001, p24) extends this list to: organisation and maintenance of an organisations investment in data; provision of information to data catalogues; provision of information to aid data transfer; discovery and retrieval of information; prevention of unauthorised access to restricted information; provide common agreement on what elements to use or what their content should be; provide information that might affect the data such as legal conditions, size, or age; the linage of the data, its history, changes etc; the owner; and relationships to other versions of the resource.
Resource discovery is enhanced by metadata as it allows resources to be found by metadata elements, which can be searched directly, for example Title, Creator, Description and Subject. Metadata identifies resources, and can therefore bring similar resources together, while at the same time distinguish dissimilar resources. Identification of the resource can be a file name, URL (Uniform Resource Locator), or a persistent identifier such as a PURL (Persistent URL) or DOI (Digital Object Identifier). These persistent identifiers the preferred choice as object locations can change, making the standard URL invalid.
Using metadata to describe an information resource makes it understandable to humans and electronic systems, promoting interoperability. A common approach to this is known as 'metadata harvesting', often associated with the Open Archives Initiative. Data providers translate their native metadata to a common core set of elements and make this available for harvesting. A service provider then gathers the metadata into a centralised index allowing cross-repository searching regardless of the metadata formats used by the original participants (Open Archives Initiative, 2004).
Metadata can help to ensure that digital information resources will survive and continue to be accessible in the future. Digital information can be corrupted or lost, or simply outdated and unusable as advances in technology overtake the format of the data type. Archiving and preservation need special metadata elements to track the lineage of the information resource, to detail its physical characteristics, and to document its behaviour in order to emulate it on future technologies.
The merit of metadata is dependent upon the quality of the metadata scheme. Metadata schemes are sets of elements designed for a particular function. Milstead & Feldman (1999) argue that:
The value of metadata elements is limited if there is no common agreement on what elements to use or what their content should be. They cannot be searched with any confidence; they might even be unintelligible when found. Metadata cannot fully serve its purpose unless it is subjected to a certain amount of standardisation.
They also argue however, that as groups recognise the need for standards, there has been a rise in conflicting standards and projects for standardising electronic resources. In fact, they state that 'there are many - perhaps too many - formal initiatives underway'.
They classify metadata schemes into three formats: simple formats with unstructured data, including proprietary formats of search engines and directories like AltaVista and Yahoo; structured formats based on emerging standards such as the Dublin Core; and rich formats based on international standards such as Machine Readable Cataloguing (MARC).
The Dublin Core was developed in 1995 at a workshop organised by the Online Computer Library Centre (OCLC) and the National Centre for Supercomputer Applications (NCSA). The workshops goal was the improvement of indexing and bibliographic control of web pages through the definition of data elements. The data element set was made as simple as possible so that it could be easily incorporated into publishing software tools. (Rowley & Farrow, 2000, p.49)
The Dublin Core has 15 elements, which have become the standard basic set for most metadata schemas. However, it focuses on resource discovery, not the other roles of metadata like organising information, interoperability between diverse systems, identification, archiving, and access control. This has paved the way for hybrid developments based on the Dublin Core to take place.
In some developments, additional metadata elements have been included to meet user requirements, in others only selected elements from a scheme are used. An example is the U.S. Department of Education's Gateway to Educational Materials (GEM). GEM based their scheme on the Dublin Core, limits the Dublin Core elements than can be used and makes some elements mandatory. GEM also defines additional elements such as Audience, Grade, Quality, and Standards. (Understanding Metadata, 2004, p.9)
One of the largest developments was the Nordic Metadata Project which developed tools for creation, harvesting, and indexing of metadata. Another, the MetaWeb Project, had the goal of developing indexing services, user tools, and metadata element sets to promote the use of metadata. Other example of extending the Dublin Core is the Computer Interchange of Museum Information (CIMI) who included a more detailed level of description to satisfy rights and use management requirements (Milstead & Feldman, 1999).
Another scheme, Platform for Internet Content Selection (PICS) supports the rating of web information resources. PICS is a working group combining a number of industries with the objective of facilitating 'the development of technologies to give users of interactive media, such as the Internet, control over the kinds of material to which they and their children have access' (W3C, 2003).
W3C describes PICS as:
* Self-rating: enable content providers to voluntarily label the content they create and distribute.
* Third-party rating: enable multiple, independent-labelling services to associate additional labels with content created and distributed by others. Services may devise their own labelling systems, and the same content may receive different labels from different services.
* Ease-of-use: enable parents and teachers to use ratings and labels from a diversity of sources to control the information that children under their supervision receive.
There are a number of schemes which focus on the archiving and preservation of the information resource including CES (Corpus Encoding Standard), EAD DTD (Encoded Archival Description Document Type Definition), MEP (Model Editions Partnership) and TEI (Text Encoding Initiative).
Howarth (2000) assumes Bakers statement that 'researchers today agree that no single type of metadata can suit every application, every type of resource, and every community of users. Rather, the broad diversity of potential metadata needs can best be met by a multiplicity of separate, but functionally focused, metadata packages or schemas'.
The problems associated with usage of multiple schemes are the target of the W3C and their Resource Description Framework (RDF) which enables different metadata schemes, created by different communities, with a single metadata architecture to support 'interoperability between applications that exchange machine-understandable information on the Web' (Boye, 1998).
The benefits of using metadata for better information retrieval are well documented. However, there are still some important issues requiring consideration before metadata can be considered the answer to the 'information overload' problem.
The American Society of Indexers (2003) states that 'most commercial search engines now assign very little weight to text found in meta tags'. Milstead & Feldman (1999) back this up with 'moreover, the newer search engines, particularly Web search engines, have been designed to search on ill-assorted collections of unstructured text. There is no hope of cataloguing the enormous array of web pages in a systematic fashion'.
Henshaw & Valauskas (2001) reinforce the inadequacies of some search engines, with their investigation into the results of a study of seven search engines: Alta Vista, Excite, Google, Hotbot, Infoseek, Lycos and Northern Light. A search was carried out for selected papers from an 'Internet-only' periodical 'First Monday' that were without metadata. The search was then carried out again five months later, after the addition of metatags. Results of the study indicated that the addition of the metadata had not significantly improved the discoverability of the papers. The authors concluded that 'for search engines to succeed, metadata has to matter; but for metadata to make a difference, it has to be widely accepted, standardized and applied in a routine fashion'.
They go on to recommend that 'although recent projects have shown the value of metadata as a catalyst for information retrieval, unfortunately these efforts in isolation will have little effect on the functionality of search engines without the development of effective and utilitarian metadata standards that are then widely used and accepted'.
Inadequacies from the web search engines are compounded by the lack of metadata schemes implemented by the content providers. El-Sherbini & Klim (2004) refer to recent research documented by O'Neill et al. which has highlighted that although some metadata are used on the majority of web based resources, the actual application of standard metadata schemes is apparent in only a few of these resources, 'a discouraging aspect of metadata usage trends on the public Web over the last five years is the seeming reluctance of content creators to adopt formal metadata schemes with which to describe their documents'.
Another issue for content providers is who actually applies the metadata? We know that web based metadata is contained within the document, and therefore usually managed by the content provider or the webmaster. Milstead & Feldman (1999) ask the question 'how do we get millions of non-information professionals to understand the importance of cataloguing to a certain level and standard when even professionals don't always agree?'
If metadata schemes are to be the long term objective for dealing with these inadequacies, perhaps regulation should be considered as an additional measure. Educating search engine and content providers only allows an increased use of the information retrieval mechanism, not necessarily a better one. Measures should be considered that prevent unscrupulous webmasters cheating and using unrelated metadata descriptions to improve placement on search engines. Apart from misleading searchers for their own personal gain, it provokes search engines to implement filtering programs, with some even disregarding metadata altogether.
Although the future of metadata is unclear, some organisations are trying to move standards forward. To date, metadata standards have focussed on the descriptive and administrative elements required for the discovery, identification, retrieval, rights and preservation of an information resource. Newer standards are being introduced to address the technical metadata required to facilitate interoperability between different systems (Understanding Metadata, 2004, p.12).
Many organisations that have developed their own hybrid metadata schemes are trying to attain international standard recognition, as Dublin Core did by becoming an official ANSI/NISO standard in 2001 and then an international standard in 2003(Understanding Metadata, 2004, p.12).
The World Wide Web Consortium (W3C) has incorporated its metadata into the 'Semantic Web', an initiative providing 'a common framework that allows data to be shared and reused across application, enterprise, and community boundaries'. The RDF framework is one of the key enablers directed to standards that increase the interoperability of metadata, rather than specific metadata schemas (Understanding Metadata, 2004, p.12).
However, Milstead & Feldman (1999) believe that 'there are too many players with too many different agendas, resulting in tremendous volatility'. They predict that:
* For some time to come, the number of players on the field will continue to increase. More communities and sub-communities will want to make sure that their resources are covered by metadata schemes.
* As metadata schemes proliferate, so will registries of the schemes, until there are registries of registries.
* At the same time, there will be some settling toward a smaller number of 'standards' in use by major groups, with a massive scattering of outliers and non-standard or even ad hoc element sets.
* As fast as an element set is developed and standardised, one or more sets of guidelines and/or interpretations will also be forthcoming. The guidelines will be aimed at assuring that creators of metadata are consistent. The interpretations will be provided by major creators of metadata and will describe how they choose to implement the elements. Bit players will have to follow along or be out of synch.
* The situation will be similar with enumerated lists for such elements as resource type. There will be a few major 'standard' lists, with various communities developing their own extensions.
* National borders will become even less relevant than they are today. The Internet itself is inherently ignorant of national borders, and users of metadata will continue to want information across borders.
* Cross-language metadata standards will be developed.
To conclude, in this essay I have introduced metadata and its roles and uses both today and in the future. I have discussed some of the perceived shortcomings of metadata and how schemes such as Dublin Core and its extended derivatives challenge them and attempt to enforce some structure on the web. The success of metadata seems to be not just dependant upon the uptake by both content providers and the search service providers, but also a consolidated input from the owners of the schemes themselves. Perhaps in the future we will see a greater control and regulation of metadata by one single governing body.