Pajek datasets

The Edinburgh Associative Thesaurus

Dataset   eat

Description directed network with 23219 vertices and 325624 arcs (564 loops); stimulus X is associated with response Y N times. directed network with 23219 vertices and 325589 arcs (564 loops); response X is associated with stimulus Y N times.

It seems that the SR network is incomplete and that it should be the inverse of RS network. directed network with 23219 vertices and 325593 arcs (564 loops); stimulus X is associated with response Y N times. Combined eatRS and eatSR, duplicated arcs (32) removed.


EAT response-stimulus (ZIP, 1321K)
EAT stimulus-response (ZIP, 1306K)
EAT stimulus-response NEW (ZIP, 1043K)


The Edinburgh Associative Thesaurus (EAT) is a set of word association norms showing the counts of word association as collected from subjects. This is not a developed semantic network such as WordNet (3), but empirical association data.

The traditional way to collect word association norms is to show or say a word to several people and ask them to say the word which first comes to their minds upon receiving the stimulus. The link established between the stimulus and the response is not semantically labelled (e.g. as synonym, antonym or by a case relation) and can only be regarded as an association.

The Edinburgh association norms were collected by growing the network from a nucleus set of words. Responses were collected to words in this nucleus set, then these responses were used to obtain further responses, and so on. In fact the cycle was repeated about three times since by then the number of different responses was so large that they could not be re-used as stimuli. Data collection stopped when 8400 stimulus words had been used. Each stimulus word was presented to 100 different subjects, each of whom received 100 words. This gave rise to a total of 55732 nodes in the Thesaurus network.

The subjects were mostly undergraduates from a wide variety of British universities. The age range of the subjects was from 17 to 22 with a mode of 19. The sex distribution was 64 per cent male and 36 per cent female. The data was collected between June 1968 and May 1971.

The database consists of two files. The SR (stimulus-response) file, and the RS (response-stimulus) file. Where words have been truncated to 19 characters to save space the per cent character (%) has been placed as the 20th.

The EAT here is that included in the MRC Psycholinguistic Database (4), for use with the other measures available there.

EAT Data Collection Procedure (1)

Stimulus words

Since the objective was to obtain a reasonably large complete mapping of the associative network for a large set of words, a systematic procedure of 'growing' the network from a small nucleus was followed. At first responses were obtained from this nucleus set, then these responses were used as stimuli to obtain further responses, and so on. In fact, this cycle was repeated about three times, since by then the number of different responses was so large that they could not all be re-used as stimuli.

The nucleus set was derived from (a) the 200 stimuli used in the Palermo and Jenkins (1964) normq (b) the 1,000 most frequent words of the Thorndike and Lorge (1944) word frequency count and (c) the basic English vocabulary of Ogden (1954).

Data collection was stopped when 8,400 stimulus words had been used. Only a minimal amount of selection of stimuli was applied in each cycle of the data collection. Effectively all responses which were English words or meaningful verbal units were included, including some phrasal forms and numerals. The data cover a wide range of grammatical form classes and inflexional forms.


Each stimulus word was presented to 100 different subjects. Each subject recieved a computer-printed sheet with 100 stimuli in randomised arrangement (to minimize priming effects). The total contribution of each subject was thus 100 responses. The verbal environment of each word for each subject was different. The instructions asked the subject to write down against each stimulus the first word it made him think of, working as quickly as possible. the total time spent on this task was measured, and most subjects completed the sheet in five to ten minutes.

Most of the data was collected in a classroom setting under supervision. Sheets which had more than 25 percent blank responses were rejected and fresh data was collected.

New version

The network SR should be equal to the transposed (mirror) version t(RS) of RS. This is not true. There are some differences:
   SR - t(RS):
     999.BELLOW       1

   t(RS) - SR:
     30.=*=          17
     ULCER.=*=        1
     THIRTY.=*=       1
     PERIOD.=*=       1
There were also 32 multiple lines. Since the weights on the parallel arcs were the same we treated them as duplicates and preserved only a single arc. The 'corrected' version is saved in


  1. Original EAT: George Kiss, Christine Armstrong, Robert Milroy and J.R.I. Piper (collected between June 1968 and May 1971).
  2. MRC Psycholinguistic Database Version modified by: Max Coltheart, S. James, J. Ramshaw, B.M. Philip, B. Reid, J. Benyon-Tinker and E. Doctor; made available by: Philip Quinlan.
  3. The present version was re-structured and documented by Michael Wilson at the Rutherford Appleton Laboratory in 1988 (2).
  4. transformed in Pajek format: V. Batagelj, 31. July 2003.
  5. combined RS and SR versions, removed duplicates: V. Batagelj, 12. August 2013.


  1. Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973) An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. Edinburgh: University Press.
  2. The present version of The Edinburgh Associative Thesaurus (ZIP, 2.7M)
  3. WordNet
  4. MRC Psycholinguistic Database
  5. Coltheart, M. (1981) MRC Psycholinguistic Database. Quarterly Journal of Experimental Psychology, 3A, 497-505.
  6. Download MRC Psycholinguistic Database 2

Pajek Data; Pajek Home
31. July 2003