This document gives a brief introduction to metadata and some of it's applications. It is at an introductory level, and assumes a basic knowledge of HTML. It demonstrates how to manually include metadata in HTML.
MetaData is way of formally including summary information about various aspects of electronic documents. In general, it encompasses any data that describes such aspects of a document as content, quality, field of knowledge, censorship details, copyright information, format etc..
Examples of metadata in the non-internet world are:
When a search engine is used to look for something on the internet, it does not look at actual documents - it looks in one or more big data bases that has been compiled automatically by software robots that 'crawl' the web. These robots use various techniques to access documents at a site but basically they all look for keywords in a documents' contents to decide how the reference to the document should be categorised.
There are some big problems with these techniques, and the following are both addressed by the addition of metadata: Bandwidth and Effectiveness.
To generate these summaries, the document has to be searched entirely, which is one of the biggest band width killers on the internet today. By placing metadata in a known place (near the top of the document) robots need only read in a few lines for each document to generate their data base entries.
Current methods do not provide useful data bases. There are a couple
of reasons for this - first is language. An unsophisticated user will often
not find information he/she is looking for because they don't use the same
language that the crawlers and data bases use. (These users would also
have the same problems with library catalogues). At the other end of the
scale are search enquiries that yield hundreds of thousands of 'hits'.
This is again largely a problem of user sophistication - a novice user
interested in 'computers' will have an impossible amount of information
to wade through to find something of use to them.
The addition of metadata will overcome a lot of these types of problems
by providing agreed vocabularies and thesauri. The other problem that metadata
will help to overcome is the quality of information required. Users will
be able to specify {pre-school, primary-school,high-school,university,..}
to determine what level of information they require. Further to this, users
may eventually be able to determine the type of resource at a given level
- so {reference,general,historic,current-news,tutorial,review,...} may
also be specified. The addition of metadata will allow much more interactive
searches, that will assist users without the required priori knowledge
to perform effective searches.
The interactive process will set the user on the right knowledge tree and may look something like:
Of course someone who knows their way around a bit more can simply provide 'field=mathematics,keyword=matrix,level=secondary,infotype=tutorial' as their search term.
Just as a movie can have several reviews, a document can have several different metadata entries that describe it. They can live inside the document, or they can exist externally in someones database.
The entries may be provided by the author or they may also be generated automatically by software (though this defeats the purpose somewhat) - or a specialist third party. The third party may be a software form you fill out with a particular database, or it may be a group of specialists in your field who you send your document to for classification (along the lines of journal submissions).
The latter option is perhaps the best, in the sense that these specialist groups already have their own rich classification systems and respect from their peers - for example, the American Mathematics Society has a primary and secondary AMS classification that is known and used by many mathematicians and the journals they read and write for. While an AMS classification number as metadata wouldn't induce the same level of respect as a refereed journal, it provides a useful amount of confidence in a documents' suitability.
The problem with this approach is that it takes someone with a good level of experience to successfully classify specialist material. These people usually expect healthy renumeration for their expertise and time. A good trade off is for authors to provide their own metadata according to a specific scheme. As an example of this, the 'Dublin-Core' metadata descriptors are now briefly explained.
The DublinCore metadata for this document may look something like:
<HTML> <HEAD> <TITLE>Metadata Quick Tutorial</TITLE> <META NAME="DC.CREATOR" CONTENT="ANDY WHITE"> <META NAME="DC.SUBJECT" CONTENT="Metadata"> <META NAME="DC.FORM" CONTENT="text/html"> <META NAME="DC.TITLE" CONTENT="Metadata Quick Tutorial"> <META NAME="DC.DESCRIPTION" CONTENT="A brief introduction to metadata. Description of DublinCore elements. Example Usage."> <META NAME="DC.DESCRIPTION" CONTENT="metadata,dublincore,dc.,dublin core, html, crawlers"> <META NAME="DC.DATE" CONTENT="19971504"> <META NAME="DC.TYPE" CONTENT="TechReport,UnRefereedArticle,Misc"> <META NAME="DC.IDENTIFIER" CONTENT="http://www.srl.rmit.edu.au/???"> <META NAME="DC.LANGUAGE" CONTENT="ENG"> <META NAME="DC.SOURCE" CONTENT="http://purl.org/metadata/dublin_core"> <META NAME="DC.SOURCE" CONTENT="http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html"> <META NAME="DC.SOURCE" CONTENT="http://www.sil.org/sgml/nisoLang3-1994.html"> <META NAME="DC.PUBLISHER" CONTENT="Sunrise Research Laboratory: http://www.srl.rmit.edu.au/"> <META NAME="DC.CONTRIBUTORS" CONTENT="liddy@rmit.edu.au, jonathan@rmit.edu.au"> <META NAME="DC.RELATION" CONTENT="????"> <META NAME="DC.RIGHTS" CONTENT="http://www.srl.rmit.edu.au/copyright.html"> </HEAD> <BODY>
where each meta tag may be included zero or more times.
Dublin-Core currently has 15 recommended tags and uses the DC.prefix to distinguish it from other schema. (Definitions as per w3. with my additions in italics)
The name given to the resource by the CREATOR or PUBLISHER.
The person(s) or organisation(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.
SUBJECT
The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as MEdical Subject Headings or Art and Architecture Thesaurus descriptors) as well.
The highest level available should be used - as computers will perform
the task of adding the more general classification schemes. This element
would likely be provided to the author after filling out a classification
form - or authors may look up agreed thesauri when they become available.
<META NAME="DC.SUBJECT"
CONTENT="MATHEMATICS,FLUIDS,PERTURBATION">
<META NAME="DC.SUBJECT"
CONTENT="Non-Newtonian Fluid Flow,
Bingham
Plastics, Rheology, Yield-Stress Fluids">
(The following are examples of specialist schema that may be used
and are not part of the Dublin-Core.)
<META NAME="AMS.PRIMARY.SUBJECT"
CONTENT="65H05">
<META NAME="AMS.SECONDARY.SUBJECT"
CONTENT="76A05">
<META NAME="DEWEY.SUBJECT"
CONTENT="532">
A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.
Keep in mind that future browsers and style-sheets will probably be able to pull this field out by itself as an abstract that people read before deciding to download the full document.
PUBLISHER
The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.
This is an ambiguous definition - an article written by an individual that he/she places on the internet via a university server would specify the university as the publisher but what if the individual placed the article on a commercial server ? Although the document resides on www.someserver.com, the individual presumably pays for it to be there and may rightly feel that they do not need to name their service provider as the publisher.
CONTRIBUTORS
Person(s) or organisation(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and conveners).
The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.
The category of the resource, such as home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html (Reproduced for convenience)
The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with TYPE, FORM will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.
String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.
The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearian sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.
Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages - see http://www.sil.org/sgml/nisoLang3-1994.html. (Reproduced for convenience)
Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
The spatial locations and temporal durations characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.
The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.
Other classification schemes are provided for reference. They may be useful in deciding what area of knowledge embodies work you are trying to classify. If it is your job to maintain a large site - you may want to consider collecting resources such as these to develop tools that assist users in classifying their work.
Code |
Language |
---|---|
ACE | Achinese |
ACH | Acoli |
ADA | Adangme |
AFA | Afro-Asiatic (Other) |
AFH | Afrihili (Artificial language) |
AFR | Afrikaans |
AJM | Aljamia |
AKK | Akkadian |
ALB | Albanian |
ALE | Aleut |
ALG | Algonquian languages |
AMH | Amharic |
ANG | English, Old (ca. 450-1100) |
APA | Apache languages |
ARA | Arabic |
ARC | Aramaic |
ARM | Armenian |
ARN | Araucanian |
ARP | Arapaho |
ART | Artificial (Other) |
ARW | Arawak |
ASM | Assamese |
ATH | Athapascan languages |
AVA | Avaric |
AVE | Avestan |
AWA | Awadhi |
AYM | Aymara |
AZE | Azerbaijani |
BAD | Banda |
BAI | Bamileke languages |
BAK | Bashkir |
BAL | Baluchi |
BAM | Bambara |
BAN | Balinese |
BAQ | Basque |
BAS | Basa |
BAT | Baltic (Other) |
BEJ | Beja |
BEL | Byelorussian |
BEM | Bemba |
BEN | Bengali |
BER | Berber languages |
BHO | Bhojpuri |
BIK | Bikol |
BIN | Bini |
BLA | Siksika |
BRA | Braj |
BRE | Breton |
BUG | Buginese |
BUL | Bulgarian |
BUR | Burmese |
CAD | Caddo |
CAI | Central American Indian (Other) |
CAM | Khmer |
CAR | Carib |
CAT | Catalan |
CAU | Caucasian (Other) |
CEB | Cebuano |
CEL | Celtic languages |
CHA | Chamorro |
CHB | Chibcha |
CHE | Chechen |
CHG | Chagatai |
CHI | Chinese |
CHN | Chinook jargon |
CHO | Choctaw |
CHR | Cherokee |
CHU | Church Slavic |
CHV | Chuvash |
CHY | Cheyenne |
COP | Coptic |
COR | Cornish |
CPE | Creoles and Pidgins, English-based (Other) |
CPF | Creoles and Pidgins, French-based (Other) |
CPP | Creoles and Pidgins, Portuguese-based (Other) |
CRE | Cree |
CRP | Creoles and Pidgins (Other) |
CUS | Cushitic (Other) |
CZE | Czech |
DAK | Dakota |
DAN | Danish |
DEL | Delaware |
DIN | Dinka |
DOI | Dogri |
DRA | Dravidian (Other) |
DUA | Duala |
DUM | Dutch, Middle (ca. 1050-1350) |
DUT | Dutch |
DYU | Dyula |
EFI | Efik |
EGY | Egyptian |
EKA | Ekajuk |
ELX | Elamite |
ENG | English |
ENM | English, Middle (1100-1500) |
ESK | Eskimo |
ESP | Esperanto |
EST | Estonian |
ETH | Ethiopic |
EWE | Ewe |
EWO | Ewondo |
FAN | Fang |
FAR | Faroese |
FAT | Fanti |
FIJ | Fijian |
FIN | Finnish |
FIU | Finno-Ugrian (Other) |
FON | Fon |
FRE | French |
FRI | Friesian |
FRM | French, Middle (ca. 1400-1600) |
FRO | French, Old (ca. 842-1400) |
FUL | Fula |
GAA | Gþ |
GAE | Gaelic (Scots) |
GAG | Gallegan |
GAL | Oromo |
GAY | Gayo |
GEM | Germanic (Other) |
GEO | Georgian |
GER | German |
GIL | Gilbertese |
GMH | German, Middle High (ca. 1050-1500) |
GOH | German, Old High (ca. 750-1050) |
GON | Gondi |
GOT | Gothic |
GRB | Grebo |
GRC | Greek, Ancient (to 1453) |
GRE | Greek, Modern (1453- ) |
GUA | Guarani |
GUJ | Gujarati |
HAI | Haida |
HAU | Hausa |
HAW | Hawaiian |
HEB | Hebrew |
HER | Herero |
HIL | Hiligaynon |
HIM | Himachali |
HIN | Hindi |
HMO | Hiri Motu |
HUN | Hungarian |
HUP | Hupa |
IBA | Iban |
IBO | Igbo |
ICE | Icelandic |
IJO | Ijo |
ILO | Iloko |
INC | Indic (Other) |
IND | Indonesian |
INE | Indo-European (Other) |
INT | Interlingua (International Auxiliary Language Association) |
IRA | Iranian (Other) |
IRI | Irish |
IRO | Iroquoian languages |
ITA | Italian |
JAV | Javanese |
JPN | Japanese |
JPR | Judeo-Persian |
JRB | Judeo-Arabic |
KAA | Kara-Kalpak |
KAB | Kabyle |
KAC | Kachin |
KAM | Kamba |
KAN | Kannada |
KAR | Karen |
KAS | Kashmiri |
KAU | Kanuri |
KAW | Kawi |
KAZ | Kazakh |
KHA | Khasi |
KHI | Khoisan (Other) |
KHO | Khotanese |
KIK | Kikuyu |
KIN | Kinyarwanda |
KIR | Kirghiz |
KOK | Konkani |
KON | Kongo |
KOR | Korean |
KPE | Kpelle |
KRO | Kru |
KRU | Kurukh |
KUA | Kuanyama |
KUR | Kurdish |
KUS | Kusaie |
KUT | Kutenai |
LAD | Ladino |
LAH | Lahnd |
LAM | Lamba |
LAN | Langue d'oc (post-1500) |
LAO | Lao |
LAP | Lapp |
LAT | Latin |
LAV | Latvian |
LIN | Lingala |
LIT | Lithuanian |
LOL | Mongo |
LOZ | Lozi |
LUB | Luba-Katanga |
LUG | Ganda |
LUI | Luiseno |
LUN | Lunda |
LUO | Luo (Kenya and Tanzania) |
MAC | Macedonian |
MAD | Madurese |
MAG | Magahi |
MAH | Marshall |
MAI | Maithili |
MAK | Makasar |
MAL | Malayalam |
MAN | Mandingo |
MAO | Maori |
MAP | Austronesian (Other) |
MAR | Marathi |
MAS | Masai |
MAX | Manx |
MAY | Malay |
MEN | Mende |
MIC | Micmac |
MIN | Minangkabau |
MIS | Miscellaneous (Other) |
MKH | Mon-Khmer (Other) |
MLA | Malagasy |
MLT | Maltese |
MNI | Manipuri |
MNO | Manobo languages |
MOH | Mohawk |
MOL | Moldavian |
MON | Mongolian |
MOSs | Mossi |
MUL | Multiple languages |
MUN | Munda (Other) |
MUS | Creek |
MWR | Marwari |
MYN | Mayan languages |
NAH | Aztec |
NAI | North American Indian (Other) |
NAV | Navajo |
NDE | Ndebele (Zimbabwe) |
NDO | Ndonga |
NEP | Nepali |
NEW | Newari |
NIC | Niger-Kordofanian (Other) |
NIU | Niuean |
NOR | Norwegian |
NSO | Northern Sotho |
NUB | Nubian languages |
NYA | Nyanja |
NYM | Nyamwezi |
NYN | Nyankole |
NYO | Nyoro |
NZI | Nzima |
OJI | Ojibwa |
ORI | Oriya |
OSA | Osage |
OSS | Ossetic |
OTA | Turkish, Ottoman |
OTO | Otomian languages |
PAA | Papuan-Australian (Other) |
PAG | Pangasinan |
PAL | Pahlavi |
PAM | Pampanga |
PAN | Panjabi |
PAP | Papiamento |
PAU | Palauan |
PEO | Old Persian (ca. 600-400 B.C.) |
PER | Persian |
PLI | Pali |
POL | Polish |
PON | Ponape |
POR | Portuguese |
PRA | Prakrit languages |
PRO | Provencal, Old (to 1500) |
PUS | Pushto |
QUE | Quechua |
RAJ | Rajasthani |
RAR | Rarotongan |
ROA | Romance (Other) |
ROH | Raeto-Romance |
ROM | Romany |
RUM | Romanian |
RUN | Rundi |
RUS | Russian |
SAD | Sandawe |
SAG | Sango |
SAI | South American Indian (Other) |
SAL | Salishan languages |
SAM | Samaritan Aramaic |
SAN | Sanskrit |
SAO | Samoan |
SCC | Serbo-Croatian (Cyrillic) |
SCO | Scots |
SCR | Serbo-Croatian (Roman) |
SEL | Selkup |
SEM | Semitic (Other) |
SHN | Shan |
SHO | Shona |
SID | Sidamo |
SIO | Siouan languages |
SIT | Sino-Tibetan (Other) |
SLA | Slavic (Other) |
SLO | Slovak |
SLV | Slovenian |
SND | Sindhi |
SNH | Sinhalese |
SOM | Somali |
SON | Songhai |
SPA | Spanish |
SRR | Serer |
SSO | Sotho |
SUK | Sukuma |
SUN | Sundanese |
SUS | Susu |
SUX | Sumerian |
SWA | Swahili |
SWZ | Swazi |
SYR | Syriac |
TAG | Tagalog |
TAH | Tahitian |
TAJ | Tajik |
TAM | Tamil |
TAR | Tatar |
TEL | Telugu |
TEM | Timne |
TER | Tereno |
THA | Thai |
TIB | Tibetan |
TIG | Tigre |
TIR | Tigrinya |
TIV | Tivi |
TLI | Tlingit |
TOG | Tonga (Nyasa) |
TON | Tonga (Tonga Islands) |
TRU | Truk |
TSI | Tsimshian |
TSO | Tsonga |
TSW | Tswana |
TUK | Turkmen |
TUM | Tumbuka |
TUR | Turkish |
TUT | Altaic (Other) |
TWI | Twi |
UGA | Ugaritic |
UIG | Uighur |
UKR | Ukrainian |
UMB | Umbundu |
UND | Undetermined |
URD | Urdu |
UZB | Uzbek |
VAI | Vai |
VEN | Venda |
VIE | Vietnamese |
VOT | Votic |
WAK | Wakashan languages |
WAL | Walamo |
WAR | Waray |
WAS | Washo |
WEL | Welsh |
WEN | Sorbian languages |
WOL | Wolof |
XHO | Xhosa |
YAO | Yao |
YAP | Yap |
YID | Yiddish |
YOR | Yoruba |
ZAP | Zapotec |
ZEN | Zenaga |
ZUL | Zulu |
ZUN | Zuni |
Copyright © 1997 sunrise@rmit.edu.au http://www.srl.rmit.edu.au/sunrise/webDIY/metadata/index.shtml