RMIT | Faculty of Art, Design and Communication | Department of Visual Communication | Sunrise Research Laboratory

MetaData - Quick Tutorial

This document gives a brief introduction to metadata and some of it's applications. It is at an introductory level, and assumes a basic knowledge of HTML. It demonstrates how to manually include metadata in HTML.

Definition
Application
Generation Techniques
The DublinCore Elements
Further Resources

Definition

MetaData is way of formally including summary information about various aspects of electronic documents. In general, it encompasses any data that describes such aspects of a document as content, quality, field of knowledge, censorship details, copyright information, format etc..

Examples of metadata in the non-internet world are:

Library catalogue systems
Film/Book/Music reviews
Indexes and Contents tables in books and journals
Fly pages in books, where details such as publisher, ISBN are provided

Application

When a search engine is used to look for something on the internet, it does not look at actual documents - it looks in one or more big data bases that has been compiled automatically by software robots that 'crawl' the web. These robots use various techniques to access documents at a site but basically they all look for keywords in a documents' contents to decide how the reference to the document should be categorised.

There are some big problems with these techniques, and the following are both addressed by the addition of metadata: Bandwidth and Effectiveness.

To generate these summaries, the document has to be searched entirely, which is one of the biggest band width killers on the internet today. By placing metadata in a known place (near the top of the document) robots need only read in a few lines for each document to generate their data base entries.

Current methods do not provide useful data bases. There are a couple of reasons for this - first is language. An unsophisticated user will often not find information he/she is looking for because they don't use the same language that the crawlers and data bases use. (These users would also have the same problems with library catalogues). At the other end of the scale are search enquiries that yield hundreds of thousands of 'hits'. This is again largely a problem of user sophistication - a novice user interested in 'computers' will have an impossible amount of information to wade through to find something of use to them.

The addition of metadata will overcome a lot of these types of problems by providing agreed vocabularies and thesauri. The other problem that metadata will help to overcome is the quality of information required. Users will be able to specify {pre-school, primary-school,high-school,university,..} to determine what level of information they require. Further to this, users may eventually be able to determine the type of resource at a given level - so {reference,general,historic,current-news,tutorial,review,...} may also be specified. The addition of metadata will allow much more interactive searches, that will assist users without the required priori knowledge to perform effective searches.

The interactive process will set the user on the right knowledge tree and may look something like:

Field of knowledge is first established with a general thesaurus. A user may enter 'matrix' and the search engine will come back with the response: 'This database has entries for "matrix" in the areas of {science,general}'
User selects 'science'. Engine then switches to the science thesaurus and responds with 'The term "matrix,science" has the following senses: mathematics: data object, metallurgy: forge and medicine: womb'
The user selects mathematics which is responded to with 'Select level required from {high-school, university}'
After high-school is selected, the reply may be 'This data base has entries for {definition,tutorial,application} select term or send the request "ISO-S001-2-09876" to the following search engines....'

Of course someone who knows their way around a bit more can simply provide 'field=mathematics,keyword=matrix,level=secondary,infotype=tutorial' as their search term.

Generation

Just as a movie can have several reviews, a document can have several different metadata entries that describe it. They can live inside the document, or they can exist externally in someones database.

The entries may be provided by the author or they may also be generated automatically by software (though this defeats the purpose somewhat) - or a specialist third party. The third party may be a software form you fill out with a particular database, or it may be a group of specialists in your field who you send your document to for classification (along the lines of journal submissions).

The latter option is perhaps the best, in the sense that these specialist groups already have their own rich classification systems and respect from their peers - for example, the American Mathematics Society has a primary and secondary AMS classification that is known and used by many mathematicians and the journals they read and write for. While an AMS classification number as metadata wouldn't induce the same level of respect as a refereed journal, it provides a useful amount of confidence in a documents' suitability.

The problem with this approach is that it takes someone with a good level of experience to successfully classify specialist material. These people usually expect healthy renumeration for their expertise and time. A good trade off is for authors to provide their own metadata according to a specific scheme. As an example of this, the 'Dublin-Core' metadata descriptors are now briefly explained.

Dublin-Core

The DublinCore metadata for this document may look something like:

<HTML>
<HEAD>
<TITLE>Metadata Quick Tutorial</TITLE>
<META NAME="DC.CREATOR" 
      CONTENT="ANDY WHITE">
<META NAME="DC.SUBJECT" 
      CONTENT="Metadata">
<META NAME="DC.FORM" 
      CONTENT="text/html">
<META NAME="DC.TITLE" 
      CONTENT="Metadata Quick Tutorial">
<META NAME="DC.DESCRIPTION" 
      CONTENT="A brief introduction to metadata. 
      Description of DublinCore elements.
      Example Usage."> 
<META NAME="DC.DESCRIPTION"
      CONTENT="metadata,dublincore,dc.,dublin core,
      html, crawlers">
<META NAME="DC.DATE"
      CONTENT="19971504">
<META NAME="DC.TYPE"
      CONTENT="TechReport,UnRefereedArticle,Misc">
<META NAME="DC.IDENTIFIER"
      CONTENT="http://www.srl.rmit.edu.au/???">
<META NAME="DC.LANGUAGE"
      CONTENT="ENG">
<META NAME="DC.SOURCE"
      CONTENT="http://purl.org/metadata/dublin_core">
<META NAME="DC.SOURCE"
      CONTENT="http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html">
<META NAME="DC.SOURCE"
      CONTENT="http://www.sil.org/sgml/nisoLang3-1994.html">
<META NAME="DC.PUBLISHER"
      CONTENT="Sunrise Research Laboratory: http://www.srl.rmit.edu.au/">
<META NAME="DC.CONTRIBUTORS"
      CONTENT="liddy@rmit.edu.au, jonathan@rmit.edu.au">
<META NAME="DC.RELATION"
      CONTENT="????">
<META NAME="DC.RIGHTS"
      CONTENT="http://www.srl.rmit.edu.au/copyright.html">
</HEAD>
<BODY>

where each meta tag may be included zero or more times.

Dublin-Core currently has 15 recommended tags and uses the DC.prefix to distinguish it from other schema. (Definitions as per w3. with my additions in italics)

`TITLE`

The name given to the resource by the CREATOR or PUBLISHER.

`CREATOR`

The person(s) or organisation(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

SUBJECT

The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

The highest level available should be used - as computers will perform the task of adding the more general classification schemes. This element would likely be provided to the author after filling out a classification form - or authors may look up agreed thesauri when they become available.

<META NAME="DC.SUBJECT" CONTENT="MATHEMATICS,FLUIDS,PERTURBATION"> <META NAME="DC.SUBJECT" CONTENT="Non-Newtonian Fluid Flow, Bingham Plastics, Rheology, Yield-Stress Fluids">(The following are examples of specialist schema that may be used and are not part of the Dublin-Core.)

<META NAME="AMS.PRIMARY.SUBJECT" CONTENT="65H05"> <META NAME="AMS.SECONDARY.SUBJECT" CONTENT="76A05"> <META NAME="DEWEY.SUBJECT" CONTENT="532">

`DESCRIPTION`

A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Keep in mind that future browsers and style-sheets will probably be able to pull this field out by itself as an abstract that people read before deciding to download the full document.

PUBLISHER

The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

This is an ambiguous definition - an article written by an individual that he/she places on the internet via a university server would specify the university as the publisher but what if the individual placed the article on a commercial server ? Although the document resides on www.someserver.com, the individual presumably pays for it to be there and may rightly feel that they do not need to name their service provider as the publisher.

CONTRIBUTORS

Person(s) or organisation(s) in addition to those specified in the CREATORelement who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and conveners).

`DATE`

The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

`TYPE`

The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html (Reproduced for convenience)

`FORM`

The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with TYPE, FORM will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

`IDENTIFIER`

String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

`SOURCE`

The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearian sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

`LANGUAGE`

Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages - see http://www.sil.org/sgml/nisoLang3-1994.html. (Reproduced for convenience)

`RELATION`

Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

`COVERAGE`

The spatial locations and temporal durations characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

`RIGHTS`

The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.

Advertisement: A commercial advertisement for a product or service.
Article: A peer reviewed , refereed article from a journal.
Bibliography: A bibliography of other resources.
Book: A complete book, not formed from separate papers.
Booklet: A work that is printed and bound but without a named publisher or sponsoring institution.
Collection: A book produced from a collection of separate papers.
CourseMaterial: Syllabus, timetable, etc for a course.
Dataset: A set of data of some sort.
HonoursThesis: A university Honours thesis.
Image: A picture of some sort.
InBook: A part of a book, which may be a chapter and/or range of pages.
InCollection: A single paper or article from a published collection.
InProceedings: A single paper from a published workshop or conference proceedings.
Journal: An entire issue of a refereed learned journal.
Magazine: An entire issue of an unrefereed journal or magazine.
Manual: An operations manual for a product.
MastersThesis: A university Masters thesis.
MessageOnModeratedMailingList: The resource is a message on a mailing list which is moderated.
MessageOnUnmoderatedMailingList: The resource is a message on a mailing list which is not moderated.
Misc: Work of another or undetermined type. This is the default scheme value if the scheme is not explicitly stated.
Music: A piece of music or a score.
OrganisationInfo: Some sort of information about an organisation or group (eg: A library homepage on the web).
PhDThesis: A university Doctoral thesis.
PersonalInfo: Some of information about an individual (eg: A person's homepage)
Poem: A piece of poetry.
PostingToModeratedNewsgroup: The resource is a message posted to a USENET newsgroup which is moderated.
PostingToUnmoderatedNewsgroup: The resource is a message posted to a USENET newsgroup which is not moderated.
Preprint: Pre-publication of a research article.
Proceedings: A whole published workshop or conference proceedings.
ResearchPaper: A piece of research work.
Service: An online service of some description.
TechReport: An internal university or research organisation technical report.
Unpublished: A document with an author and title, but not formally published.
UnrefereedArticle: An unrefereed article from a journal, magazine or newspaper.
Video: A video of some sort.

[RETURN]

Code	Language
ACE	Achinese
ACH	Acoli
ADA	Adangme
AFA	Afro-Asiatic (Other)
AFH	Afrihili (Artificial language)
AFR	Afrikaans
AJM	Aljamia
AKK	Akkadian
ALB	Albanian
ALE	Aleut
ALG	Algonquian languages
AMH	Amharic
ANG	English, Old (ca. 450-1100)
APA	Apache languages
ARA	Arabic
ARC	Aramaic
ARM	Armenian
ARN	Araucanian
ARP	Arapaho
ART	Artificial (Other)
ARW	Arawak
ASM	Assamese
ATH	Athapascan languages
AVA	Avaric
AVE	Avestan
AWA	Awadhi
AYM	Aymara
AZE	Azerbaijani
BAD	Banda
BAI	Bamileke languages
BAK	Bashkir
BAL	Baluchi
BAM	Bambara
BAN	Balinese
BAQ	Basque
BAS	Basa
BAT	Baltic (Other)
BEJ	Beja
BEL	Byelorussian
BEM	Bemba
BEN	Bengali
BER	Berber languages
BHO	Bhojpuri
BIK	Bikol
BIN	Bini
BLA	Siksika
BRA	Braj
BRE	Breton
BUG	Buginese
BUL	Bulgarian
BUR	Burmese
CAD	Caddo
CAI	Central American Indian (Other)
CAM	Khmer
CAR	Carib
CAT	Catalan
CAU	Caucasian (Other)
CEB	Cebuano
CEL	Celtic languages
CHA	Chamorro
CHB	Chibcha
CHE	Chechen
CHG	Chagatai
CHI	Chinese
CHN	Chinook jargon
CHO	Choctaw
CHR	Cherokee
CHU	Church Slavic
CHV	Chuvash
CHY	Cheyenne
COP	Coptic
COR	Cornish
CPE	Creoles and Pidgins, English-based (Other)
CPF	Creoles and Pidgins, French-based (Other)
CPP	Creoles and Pidgins, Portuguese-based (Other)
CRE	Cree
CRP	Creoles and Pidgins (Other)
CUS	Cushitic (Other)
CZE	Czech
DAK	Dakota
DAN	Danish
DEL	Delaware
DIN	Dinka
DOI	Dogri
DRA	Dravidian (Other)
DUA	Duala
DUM	Dutch, Middle (ca. 1050-1350)
DUT	Dutch
DYU	Dyula
EFI	Efik
EGY	Egyptian
EKA	Ekajuk
ELX	Elamite
ENG	English
ENM	English, Middle (1100-1500)
ESK	Eskimo
ESP	Esperanto
EST	Estonian
ETH	Ethiopic
EWE	Ewe
EWO	Ewondo
FAN	Fang
FAR	Faroese
FAT	Fanti
FIJ	Fijian
FIN	Finnish
FIU	Finno-Ugrian (Other)
FON	Fon
FRE	French
FRI	Friesian
FRM	French, Middle (ca. 1400-1600)
FRO	French, Old (ca. 842-1400)
FUL	Fula
GAA	Gþ
GAE	Gaelic (Scots)
GAG	Gallegan
GAL	Oromo
GAY	Gayo
GEM	Germanic (Other)
GEO	Georgian
GER	German
GIL	Gilbertese
GMH	German, Middle High (ca. 1050-1500)
GOH	German, Old High (ca. 750-1050)
GON	Gondi
GOT	Gothic
GRB	Grebo
GRC	Greek, Ancient (to 1453)
GRE	Greek, Modern (1453- )
GUA	Guarani
GUJ	Gujarati
HAI	Haida
HAU	Hausa
HAW	Hawaiian
HEB	Hebrew
HER	Herero
HIL	Hiligaynon
HIM	Himachali
HIN	Hindi
HMO	Hiri Motu
HUN	Hungarian
HUP	Hupa
IBA	Iban
IBO	Igbo
ICE	Icelandic
IJO	Ijo
ILO	Iloko
INC	Indic (Other)
IND	Indonesian
INE	Indo-European (Other)
INT	Interlingua (International Auxiliary Language Association)
IRA	Iranian (Other)
IRI	Irish
IRO	Iroquoian languages
ITA	Italian
JAV	Javanese
JPN	Japanese
JPR	Judeo-Persian
JRB	Judeo-Arabic
KAA	Kara-Kalpak
KAB	Kabyle
KAC	Kachin
KAM	Kamba
KAN	Kannada
KAR	Karen
KAS	Kashmiri
KAU	Kanuri
KAW	Kawi
KAZ	Kazakh
KHA	Khasi
KHI	Khoisan (Other)
KHO	Khotanese
KIK	Kikuyu
KIN	Kinyarwanda
KIR	Kirghiz
KOK	Konkani
KON	Kongo
KOR	Korean
KPE	Kpelle
KRO	Kru
KRU	Kurukh
KUA	Kuanyama
KUR	Kurdish
KUS	Kusaie
KUT	Kutenai
LAD	Ladino
LAH	Lahnd
LAM	Lamba
LAN	Langue d'oc (post-1500)
LAO	Lao
LAP	Lapp
LAT	Latin
LAV	Latvian
LIN	Lingala
LIT	Lithuanian
LOL	Mongo
LOZ	Lozi
LUB	Luba-Katanga
LUG	Ganda
LUI	Luiseno
LUN	Lunda
LUO	Luo (Kenya and Tanzania)
MAC	Macedonian
MAD	Madurese
MAG	Magahi
MAH	Marshall
MAI	Maithili
MAK	Makasar
MAL	Malayalam
MAN	Mandingo
MAO	Maori
MAP	Austronesian (Other)
MAR	Marathi
MAS	Masai
MAX	Manx
MAY	Malay
MEN	Mende
MIC	Micmac
MIN	Minangkabau
MIS	Miscellaneous (Other)
MKH	Mon-Khmer (Other)
MLA	Malagasy
MLT	Maltese
MNI	Manipuri
MNO	Manobo languages
MOH	Mohawk
MOL	Moldavian
MON	Mongolian
MOSs	Mossi
MUL	Multiple languages
MUN	Munda (Other)
MUS	Creek
MWR	Marwari
MYN	Mayan languages
NAH	Aztec
NAI	North American Indian (Other)
NAV	Navajo
NDE	Ndebele (Zimbabwe)
NDO	Ndonga
NEP	Nepali
NEW	Newari
NIC	Niger-Kordofanian (Other)
NIU	Niuean
NOR	Norwegian
NSO	Northern Sotho
NUB	Nubian languages
NYA	Nyanja
NYM	Nyamwezi
NYN	Nyankole
NYO	Nyoro
NZI	Nzima
OJI	Ojibwa
ORI	Oriya
OSA	Osage
OSS	Ossetic
OTA	Turkish, Ottoman
OTO	Otomian languages
PAA	Papuan-Australian (Other)
PAG	Pangasinan
PAL	Pahlavi
PAM	Pampanga
PAN	Panjabi
PAP	Papiamento
PAU	Palauan
PEO	Old Persian (ca. 600-400 B.C.)
PER	Persian
PLI	Pali
POL	Polish
PON	Ponape
POR	Portuguese
PRA	Prakrit languages
PRO	Provencal, Old (to 1500)
PUS	Pushto
QUE	Quechua
RAJ	Rajasthani
RAR	Rarotongan
ROA	Romance (Other)
ROH	Raeto-Romance
ROM	Romany
RUM	Romanian
RUN	Rundi
RUS	Russian
SAD	Sandawe
SAG	Sango
SAI	South American Indian (Other)
SAL	Salishan languages
SAM	Samaritan Aramaic
SAN	Sanskrit
SAO	Samoan
SCC	Serbo-Croatian (Cyrillic)
SCO	Scots
SCR	Serbo-Croatian (Roman)
SEL	Selkup
SEM	Semitic (Other)
SHN	Shan
SHO	Shona
SID	Sidamo
SIO	Siouan languages
SIT	Sino-Tibetan (Other)
SLA	Slavic (Other)
SLO	Slovak
SLV	Slovenian
SND	Sindhi
SNH	Sinhalese
SOM	Somali
SON	Songhai
SPA	Spanish
SRR	Serer
SSO	Sotho
SUK	Sukuma
SUN	Sundanese
SUS	Susu
SUX	Sumerian
SWA	Swahili
SWZ	Swazi
SYR	Syriac
TAG	Tagalog
TAH	Tahitian
TAJ	Tajik
TAM	Tamil
TAR	Tatar
TEL	Telugu
TEM	Timne
TER	Tereno
THA	Thai
TIB	Tibetan
TIG	Tigre
TIR	Tigrinya
TIV	Tivi
TLI	Tlingit
TOG	Tonga (Nyasa)
TON	Tonga (Tonga Islands)
TRU	Truk
TSI	Tsimshian
TSO	Tsonga
TSW	Tswana
TUK	Turkmen
TUM	Tumbuka
TUR	Turkish
TUT	Altaic (Other)
TWI	Twi
UGA	Ugaritic
UIG	Uighur
UKR	Ukrainian
UMB	Umbundu
UND	Undetermined
URD	Urdu
UZB	Uzbek
VAI	Vai
VEN	Venda
VIE	Vietnamese
VOT	Votic
WAK	Wakashan languages
WAL	Walamo
WAR	Waray
WAS	Washo
WEL	Welsh
WEN	Sorbian languages
WOL	Wolof
XHO	Xhosa
YAO	Yao
YAP	Yap
YID	Yiddish
YOR	Yoruba
ZAP	Zapotec
ZEN	Zenaga
ZUL	Zulu
ZUN	Zuni

[RETURN]

MetaData - Quick Tutorial

Definition

Application

Generation

Dublin-Core

TITLE

CREATOR

DESCRIPTION

DATE

TYPE

FORM

IDENTIFIER

SOURCE

LANGUAGE

RELATION

COVERAGE

RIGHTS

Further Resources