RMIT | Faculty of Art, Design and Communication | Department of Visual Communication | Sunrise Research Laboratory

MetaData - Quick Tutorial

This document gives a brief introduction to metadata and some of it's applications. It is at an introductory level, and assumes a basic knowledge of HTML. It demonstrates how to manually include metadata in HTML.

Definition

MetaData is way of formally including summary information about various aspects of electronic documents. In general, it encompasses any data that describes such aspects of a document as content, quality, field of knowledge, censorship details, copyright information, format etc..

Examples of metadata in the non-internet world are:

Application

When a search engine is used to look for something on the internet, it does not look at actual documents - it looks in one or more big data bases that has been compiled automatically by software robots that 'crawl' the web. These robots use various techniques to access documents at a site but basically they all look for keywords in a documents' contents to decide how the reference to the document should be categorised.

There are some big problems with these techniques, and the following are both addressed by the addition of metadata: Bandwidth and Effectiveness.

To generate these summaries, the document has to be searched entirely, which is one of the biggest band width killers on the internet today. By placing metadata in a known place (near the top of the document) robots need only read in a few lines for each document to generate their data base entries.

Current methods do not provide useful data bases. There are a couple of reasons for this - first is language. An unsophisticated user will often not find information he/she is looking for because they don't use the same language that the crawlers and data bases use. (These users would also have the same problems with library catalogues). At the other end of the scale are search enquiries that yield hundreds of thousands of 'hits'. This is again largely a problem of user sophistication - a novice user interested in 'computers' will have an impossible amount of information to wade through to find something of use to them.

The addition of metadata will overcome a lot of these types of problems by providing agreed vocabularies and thesauri. The other problem that metadata will help to overcome is the quality of information required. Users will be able to specify {pre-school, primary-school,high-school,university,..} to determine what level of information they require. Further to this, users may eventually be able to determine the type of resource at a given level - so {reference,general,historic,current-news,tutorial,review,...} may also be specified. The addition of metadata will allow much more interactive searches, that will assist users without the required priori knowledge to perform effective searches.

The interactive process will set the user on the right knowledge tree and may look something like:

  1. Field of knowledge is first established with a general thesaurus. A user may enter 'matrix' and the search engine will come back with the response: 'This database has entries for "matrix" in the areas of {science,general}'
  2. User selects 'science'. Engine then switches to the science thesaurus and responds with 'The term "matrix,science" has the following senses: mathematics: data object, metallurgy: forge and medicine: womb'
  3. The user selects mathematics which is responded to with 'Select level required from {high-school, university}'
  4. After high-school is selected, the reply may be 'This data base has entries for {definition,tutorial,application} select term or send the request "ISO-S001-2-09876" to the following search engines....'

Of course someone who knows their way around a bit more can simply provide 'field=mathematics,keyword=matrix,level=secondary,infotype=tutorial' as their search term.

Generation

Just as a movie can have several reviews, a document can have several different metadata entries that describe it. They can live inside the document, or they can exist externally in someones database.

The entries may be provided by the author or they may also be generated automatically by software (though this defeats the purpose somewhat) - or a specialist third party. The third party may be a software form you fill out with a particular database, or it may be a group of specialists in your field who you send your document to for classification (along the lines of journal submissions).

The latter option is perhaps the best, in the sense that these specialist groups already have their own rich classification systems and respect from their peers - for example, the American Mathematics Society has a primary and secondary AMS classification that is known and used by many mathematicians and the journals they read and write for. While an AMS classification number as metadata wouldn't induce the same level of respect as a refereed journal, it provides a useful amount of confidence in a documents' suitability.

The problem with this approach is that it takes someone with a good level of experience to successfully classify specialist material. These people usually expect healthy renumeration for their expertise and time. A good trade off is for authors to provide their own metadata according to a specific scheme. As an example of this, the 'Dublin-Core' metadata descriptors are now briefly explained.

Dublin-Core

The DublinCore metadata for this document may look something like:

<HTML>
<HEAD>
<TITLE>Metadata Quick Tutorial</TITLE>
<META NAME="DC.CREATOR" 
      CONTENT="ANDY WHITE">
<META NAME="DC.SUBJECT" 
      CONTENT="Metadata">
<META NAME="DC.FORM" 
      CONTENT="text/html">
<META NAME="DC.TITLE" 
      CONTENT="Metadata Quick Tutorial">
<META NAME="DC.DESCRIPTION" 
      CONTENT="A brief introduction to metadata. 
      Description of DublinCore elements.
      Example Usage."> 
<META NAME="DC.DESCRIPTION"
      CONTENT="metadata,dublincore,dc.,dublin core,
      html, crawlers">
<META NAME="DC.DATE"
      CONTENT="19971504">
<META NAME="DC.TYPE"
      CONTENT="TechReport,UnRefereedArticle,Misc">
<META NAME="DC.IDENTIFIER"
      CONTENT="http://www.srl.rmit.edu.au/???">
<META NAME="DC.LANGUAGE"
      CONTENT="ENG">
<META NAME="DC.SOURCE"
      CONTENT="http://purl.org/metadata/dublin_core">
<META NAME="DC.SOURCE"
      CONTENT="http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html">
<META NAME="DC.SOURCE"
      CONTENT="http://www.sil.org/sgml/nisoLang3-1994.html">
<META NAME="DC.PUBLISHER"
      CONTENT="Sunrise Research Laboratory: http://www.srl.rmit.edu.au/">
<META NAME="DC.CONTRIBUTORS"
      CONTENT="liddy@rmit.edu.au, jonathan@rmit.edu.au">
<META NAME="DC.RELATION"
      CONTENT="????">
<META NAME="DC.RIGHTS"
      CONTENT="http://www.srl.rmit.edu.au/copyright.html">
</HEAD>
<BODY>

where each meta tag may be included zero or more times.

Dublin-Core currently has 15 recommended tags and uses the DC.prefix to distinguish it from other schema. (Definitions as per w3. with my additions in italics)

TITLE

The name given to the resource by the CREATOR or PUBLISHER.

CREATOR

The person(s) or organisation(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

SUBJECT

The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

The highest level available should be used - as computers will perform the task of adding the more general classification schemes. This element would likely be provided to the author after filling out a classification form - or authors may look up agreed thesauri when they become available.

<META NAME="DC.SUBJECT"
      CONTENT="MATHEMATICS,FLUIDS,PERTURBATION">
<META NAME="DC.SUBJECT"       
      CONTENT="Non-Newtonian Fluid Flow, Bingham
      Plastics, Rheology, Yield-Stress Fluids">

(The following are examples of specialist schema that may be used and are not part of the Dublin-Core.)

<META NAME="AMS.PRIMARY.SUBJECT"       
      CONTENT="65H05">
<META NAME="AMS.SECONDARY.SUBJECT"
      CONTENT="76A05">
<META NAME="DEWEY.SUBJECT"
      CONTENT="532">

DESCRIPTION

A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Keep in mind that future browsers and style-sheets will probably be able to pull this field out by itself as an abstract that people read before deciding to download the full document.

PUBLISHER

The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

This is an ambiguous definition - an article written by an individual that he/she places on the internet via a university server would specify the university as the publisher but what if the individual placed the article on a commercial server ? Although the document resides on www.someserver.com, the individual presumably pays for it to be there and may rightly feel that they do not need to name their service provider as the publisher.

CONTRIBUTORS

Person(s) or organisation(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and conveners).

DATE

The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

TYPE

The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html (Reproduced for convenience)

FORM

The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with TYPE, FORM will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

IDENTIFIER

String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

SOURCE

The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearian sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

LANGUAGE

Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages - see http://www.sil.org/sgml/nisoLang3-1994.html. (Reproduced for convenience)

RELATION

Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

COVERAGE

The spatial locations and temporal durations characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

RIGHTS

The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.


Further Resources

Other classification schemes are provided for reference. They may be useful in deciding what area of knowledge embodies work you are trying to classify. If it is your job to maintain a large site - you may want to consider collecting resources such as these to develop tools that assist users in classifying their work.




Advertisement
A commercial advertisement for a product or service.
Article
A peer reviewed , refereed article from a journal.
Bibliography
A bibliography of other resources.
Book
A complete book, not formed from separate papers.
Booklet
A work that is printed and bound but without a named publisher or sponsoring institution.
Collection
A book produced from a collection of separate papers.
CourseMaterial
Syllabus, timetable, etc for a course.
Dataset
A set of data of some sort.
HonoursThesis
A university Honours thesis.
Image
A picture of some sort.
InBook
A part of a book, which may be a chapter and/or range of pages.
InCollection
A single paper or article from a published collection.
InProceedings
A single paper from a published workshop or conference proceedings.
Journal
An entire issue of a refereed learned journal.
Magazine
An entire issue of an unrefereed journal or magazine.
Manual
An operations manual for a product.
MastersThesis
A university Masters thesis.
MessageOnModeratedMailingList
The resource is a message on a mailing list which is moderated.
MessageOnUnmoderatedMailingList
The resource is a message on a mailing list which is not moderated.
Misc
Work of another or undetermined type. This is the default scheme value if the scheme is not explicitly stated.
Music
A piece of music or a score.
OrganisationInfo
Some sort of information about an organisation or group (eg: A library homepage on the web).
PhDThesis
A university Doctoral thesis.
PersonalInfo
Some of information about an individual (eg: A person's homepage)
Poem
A piece of poetry.
PostingToModeratedNewsgroup
The resource is a message posted to a USENET newsgroup which is moderated.
PostingToUnmoderatedNewsgroup
The resource is a message posted to a USENET newsgroup which is not moderated.
Preprint
Pre-publication of a research article.
Proceedings
A whole published workshop or conference proceedings.
ResearchPaper
A piece of research work.
Service
An online service of some description.
TechReport
An internal university or research organisation technical report.
Unpublished
A document with an author and title, but not formally published.
UnrefereedArticle
An unrefereed article from a journal, magazine or newspaper.
Video
A video of some sort.

[RETURN]



Code

Language
ACE Achinese
ACH Acoli
ADA Adangme
AFA Afro-Asiatic (Other)
AFH Afrihili (Artificial language)
AFR Afrikaans
AJM Aljamia
AKK Akkadian
ALB Albanian
ALE Aleut
ALG Algonquian languages
AMH Amharic
ANG English, Old (ca. 450-1100)
APA Apache languages
ARA Arabic
ARC Aramaic
ARM Armenian
ARN Araucanian
ARP Arapaho
ART Artificial (Other)
ARW Arawak
ASM Assamese
ATH Athapascan languages
AVA Avaric
AVE Avestan
AWA Awadhi
AYM Aymara
AZE Azerbaijani
BAD Banda
BAI Bamileke languages
BAK Bashkir
BAL Baluchi
BAM Bambara
BAN Balinese
BAQ Basque
BAS Basa
BAT Baltic (Other)
BEJ Beja
BEL Byelorussian
BEM Bemba
BEN Bengali
BER Berber languages
BHO Bhojpuri
BIK Bikol
BIN Bini
BLA Siksika
BRA Braj
BRE Breton
BUG Buginese
BUL Bulgarian
BUR Burmese
CAD Caddo
CAI Central American Indian (Other)
CAM Khmer
CAR Carib
CAT Catalan
CAU Caucasian (Other)
CEB Cebuano
CEL Celtic languages
CHA Chamorro
CHB Chibcha
CHE Chechen
CHG Chagatai
CHI Chinese
CHN Chinook jargon
CHO Choctaw
CHR Cherokee
CHU Church Slavic
CHV Chuvash
CHY Cheyenne
COP Coptic
COR Cornish
CPE Creoles and Pidgins, English-based (Other)
CPF Creoles and Pidgins, French-based (Other)
CPP Creoles and Pidgins, Portuguese-based (Other)
CRE Cree
CRP Creoles and Pidgins (Other)
CUS Cushitic (Other)
CZE Czech
DAK Dakota
DAN Danish
DEL Delaware
DIN Dinka
DOI Dogri
DRA Dravidian (Other)
DUA Duala
DUM Dutch, Middle (ca. 1050-1350)
DUT Dutch
DYU Dyula
EFI Efik
EGY Egyptian
EKA Ekajuk
ELX Elamite
ENG English
ENM English, Middle (1100-1500)
ESK Eskimo
ESP Esperanto
EST Estonian
ETH Ethiopic
EWE Ewe
EWO Ewondo
FAN Fang
FAR Faroese
FAT Fanti
FIJ Fijian
FIN Finnish
FIU Finno-Ugrian (Other)
FON Fon
FRE French
FRI Friesian
FRM French, Middle (ca. 1400-1600)
FRO French, Old (ca. 842-1400)
FUL Fula
GAA
GAE Gaelic (Scots)
GAG Gallegan
GAL Oromo
GAY Gayo
GEM Germanic (Other)
GEO Georgian
GER German
GIL Gilbertese
GMH German, Middle High (ca. 1050-1500)
GOH German, Old High (ca. 750-1050)
GON Gondi
GOT Gothic
GRB Grebo
GRC Greek, Ancient (to 1453)
GRE Greek, Modern (1453- )
GUA Guarani
GUJ Gujarati
HAI Haida
HAU Hausa
HAW Hawaiian
HEB Hebrew
HER Herero
HIL Hiligaynon
HIM Himachali
HIN Hindi
HMO Hiri Motu
HUN Hungarian
HUP Hupa
IBA Iban
IBO Igbo
ICE Icelandic
IJO Ijo
ILO Iloko
INC Indic (Other)
IND Indonesian
INE Indo-European (Other)
INT Interlingua (International Auxiliary Language Association)
IRA Iranian (Other)
IRI Irish
IRO Iroquoian languages
ITA Italian
JAV Javanese
JPN Japanese
JPR Judeo-Persian
JRB Judeo-Arabic
KAA Kara-Kalpak
KAB Kabyle
KAC Kachin
KAM Kamba
KAN Kannada
KAR Karen
KAS Kashmiri
KAU Kanuri
KAW Kawi
KAZ Kazakh
KHA Khasi
KHI Khoisan (Other)
KHO Khotanese
KIK Kikuyu
KIN Kinyarwanda
KIR Kirghiz
KOK Konkani
KON Kongo
KOR Korean
KPE Kpelle
KRO Kru
KRU Kurukh
KUA Kuanyama
KUR Kurdish
KUS Kusaie
KUT Kutenai
LAD Ladino
LAH Lahnd
LAM Lamba
LAN Langue d'oc (post-1500)
LAO Lao
LAP Lapp
LAT Latin
LAV Latvian
LIN Lingala
LIT Lithuanian
LOL Mongo
LOZ Lozi
LUB Luba-Katanga
LUG Ganda
LUI Luiseno
LUN Lunda
LUO Luo (Kenya and Tanzania)
MAC Macedonian
MAD Madurese
MAG Magahi
MAH Marshall
MAI Maithili
MAK Makasar
MAL Malayalam
MAN Mandingo
MAO Maori
MAP Austronesian (Other)
MAR Marathi
MAS Masai
MAX Manx
MAY Malay
MEN Mende
MIC Micmac
MIN Minangkabau
MIS Miscellaneous (Other)
MKH Mon-Khmer (Other)
MLA Malagasy
MLT Maltese
MNI Manipuri
MNO Manobo languages
MOH Mohawk
MOL Moldavian
MON Mongolian
MOSs Mossi
MUL Multiple languages
MUN Munda (Other)
MUS Creek
MWR Marwari
MYN Mayan languages
NAH Aztec
NAI North American Indian (Other)
NAV Navajo
NDE Ndebele (Zimbabwe)
NDO Ndonga
NEP Nepali
NEW Newari
NIC Niger-Kordofanian (Other)
NIU Niuean
NOR Norwegian
NSO Northern Sotho
NUB Nubian languages
NYA Nyanja
NYM Nyamwezi
NYN Nyankole
NYO Nyoro
NZI Nzima
OJI Ojibwa
ORI Oriya
OSA Osage
OSS Ossetic
OTA Turkish, Ottoman
OTO Otomian languages
PAA Papuan-Australian (Other)
PAG Pangasinan
PAL Pahlavi
PAM Pampanga
PAN Panjabi
PAP Papiamento
PAU Palauan
PEO Old Persian (ca. 600-400 B.C.)
PER Persian
PLI Pali
POL Polish
PON Ponape
POR Portuguese
PRA Prakrit languages
PRO Provencal, Old (to 1500)
PUS Pushto
QUE Quechua
RAJ Rajasthani
RAR Rarotongan
ROA Romance (Other)
ROH Raeto-Romance
ROM Romany
RUM Romanian
RUN Rundi
RUS Russian
SAD Sandawe
SAG Sango
SAI South American Indian (Other)
SAL Salishan languages
SAM Samaritan Aramaic
SAN Sanskrit
SAO Samoan
SCC Serbo-Croatian (Cyrillic)
SCO Scots
SCR Serbo-Croatian (Roman)
SEL Selkup
SEM Semitic (Other)
SHN Shan
SHO Shona
SID Sidamo
SIO Siouan languages
SIT Sino-Tibetan (Other)
SLA Slavic (Other)
SLO Slovak
SLV Slovenian
SND Sindhi
SNH Sinhalese
SOM Somali
SON Songhai
SPA Spanish
SRR Serer
SSO Sotho
SUK Sukuma
SUN Sundanese
SUS Susu
SUX Sumerian
SWA Swahili
SWZ Swazi
SYR Syriac
TAG Tagalog
TAH Tahitian
TAJ Tajik
TAM Tamil
TAR Tatar
TEL Telugu
TEM Timne
TER Tereno
THA Thai
TIB Tibetan
TIG Tigre
TIR Tigrinya
TIV Tivi
TLI Tlingit
TOG Tonga (Nyasa)
TON Tonga (Tonga Islands)
TRU Truk
TSI Tsimshian
TSO Tsonga
TSW Tswana
TUK Turkmen
TUM Tumbuka
TUR Turkish
TUT Altaic (Other)
TWI Twi
UGA Ugaritic
UIG Uighur
UKR Ukrainian
UMB Umbundu
UND Undetermined
URD Urdu
UZB Uzbek
VAI Vai
VEN Venda
VIE Vietnamese
VOT Votic
WAK Wakashan languages
WAL Walamo
WAR Waray
WAS Washo
WEL Welsh
WEN Sorbian languages
WOL Wolof
XHO Xhosa
YAO Yao
YAP Yap
YID Yiddish
YOR Yoruba
ZAP Zapotec
ZEN Zenaga
ZUL Zulu
ZUN Zuni

[RETURN]


| Search Sunrise | Tutorials | W3C in Australia | Subject Associations | Internet Unplugged | Virtual Atlas | Writings | Contact Sunrise |

Copyright © 1997 sunrise@rmit.edu.au http://www.srl.rmit.edu.au/sunrise/webDIY/metadata/index.shtml