Psycho/Neurolinguistic Databases & Resources
Welcome language enthusiasts! This 'database of databases’ is maintained by Jamie Reilly whenever the odd PubMed alert catches his eye. Please email if you encounter broken links or have content suggestions. We hope you find what you’re looking for.
Affective and Social Cognition Norms
affective ratings for >14k English words
Norms of valence, arousal, and dominance for 13,915 English lemmas from Warriner et al (2013).
affectvec
Vector based norms for over 70,000 English lemmas across over 200 fine-grained affective dimensions really down to the nittiest of the nitty gritty. These are interesting because they appear to have been generated using an embedding approach but are normalized to something that looks like a -1 to 1 Pearson Correlation for each particular dimension, making them more feature-based and useful for single words.
bilingual valence and arousal ratings for l1-L2
This very cool database lists valence and arousal ratings for bilingual adults evaluating English words (i.e., English as L2). Link to the OSF here to access by Imbault and colleagues (2020) How are words felt in a second language: Norms for 2,628 English words for valence and arousal by L2 speakers.
grievance dictionary: language use in the context of grievance-fueled violence threat
This is a cool resource examining how language differs in the context of threat. van der Vegt and colleagues (2021) report norms for word usage commonly used in automated identification of threat in social media posts, etc. Click on the link above to access the data via the OSF.
NRC Word-Emotion Association Lexicon (emo-lex)
Emo-lex is a terrific set of crowdsourced word norms for many English words characterized across eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). This work was spearheaded by computational linguist, Dr. Saif Mohammad, at the National Research Council Canada.
social norms for 8388 english words
Diveica and colleagues have collected a terrific set of norms along with an inclusive definition of socialness (no easy feat!). Link to the data above on the OSF. To view the preprint, see Diveica, Pexman, & Binney (2021) Quantifying Social Semantics: an Inclusive Definition of Socialness and Ratings for 8,388 English Words. PsyArXiv.
Valence, Arousal, Dominance Lexicon (NRC-VAD)
Check out this terrific resource from Saif Mohammad — 20,000 English words rated on valence, arousal, and dominance using Best-Worst scaling. Plus, these words have translations to over 100 languages. Wow! The empirical paper describing the methodology is: Mohammad, S (2019). Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.
valence norms COVID19 effects of age & pandemic
Very cool work showing resilience among older US and UK adults to the effects of pandemic on ratings of positivity for thousands of English words. Link to the data above. Read the paper from Kyröläinen AJ, Luke J, Libben G, Kuperman V. Valence norms for 3,600 English words collected during the COVID-19 pandemic: Effects of age and the pandemic. Behav Res Methods. 2021 Dec 16:1-12. doi: 10.3758/s13428-021-01740-0
Age of Acquisition & Early Childhood Language
AOa English words for Spanish L2 speakers
Thanks very much to my new Twitter friend, Dr. Carlos Romero-Rivas for pointing out this very interesting database and article on subjective AOA ratings for English words by Spanish-English bilinguals. Come for the data but read the article! It’s pretty damn fascinating.
aoa ratings >50k english words
The Kuperman et al. norms represent an expanded list from Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 1-13. To access the norms visit here.
children’s picture book lexical database
We can’t just apply lexical norms (e.g., frequency, imageability) gleaned from adults to understand the linguistic world of children. That’s what’s so awesome about this work by Green and collegues (2023 - click link above). The authors report norms for >25k words, including bigrams and multiword utterances. AWESOME stuff for you developmentalists out there. CLICK HERE for the data.
American Sign Language
asl-lex
This visually stunning and beautifully crafted database is from Professors Naomi Casselli, Zed Sevcikova Sehyr, Ariel Cohen-Goldberg, & Karen Emmorey. ASL-LEX provides lexical and phonological properties for about 1,000 signs of American Sign Language, including iconicity, frequency, and many other variables.
asl signbank
Wow! Here’s a huge dictionary of ASL signs linked with ID glosses. Signbank is linkable to ELAN and is part of the SLAAASh (“Sign Language Acquisition, Annotation, Archiving and Sharing”) project (link here to read about it) through UConn and Gallaudet Universities.
Aphasia
AphasiaBank
One of the many TalkBank repositories from the great Brian McWhinney. Link to the homepage above.
Arabic
aralex: a lexical database for modern standard Arabic
This is a terrific resource for psycholinguistic investigations of Arabic reported by Boudelaa, S. & Marslen-Wilson, W.D. (2010) in Behaviour Research Methods. Click HERE for the interface and above for the BRM article.
LexArabic: Receptive vocabulary test for estimating Arabic proficiency
Hats off to Dr. Alaa Alzahrani for creating this open-source (yes!) resource for assessing L2 Arabic proficiency. Click above to link to the article in Behavior Research Methods. Super useful!
Bilingualism & Multilingualism
iris digital repository of materials for research into second languages
We should all be conducting language research with an eye toward multilingualism and cross-language generalization. Here is a set of resources for people on the forefront of these efforts. Thanks to Cylcia Boilbaugh for linking me to this resource. Read one of the origin papers here. Click the link above to view the Iris portal.
multilingual eye-movement corpus (MECO)
Wow oh wow. Led by Victor Kuperman and Noam Siegelman, this data repository reflects eye movement data from reading studies in native readers of 13 languages. Access the data by linking above.
Blackfoot
Blackfoot Words: a database of Blackfoot lexical forms
This awesome work from Weber et al (2023) includes (in the authors’ own words)…”structure and creation of Blackfoot Words, a new relational database of lexical forms (inflected words, stems, and morphemes) in Blackfoot (Algonquian; ISO 639-3: bla). To date, we have digitized 63,493 individual lexical forms from 30 sources, representing all four major dialects, and spanning the years 1743-2017. Version 1.1 of the database includes lexical forms from nine of these sources.”
Chinese
Norms for 1286 colored pictures in Cantonese
Zhong et al. (2024) report this picture norming study using stimuli from Multipic and the 2005 classic McRae norms applied to Cantonese. Link to the OSF repository to grab the norms directly!
chinese: ANCW: Affective norms for 4030 Chinese words
Link above to read the article by Ying and colleagues (2023, in press) in BRM. These words have arousal, concreteness, valence, and dominance norms. Link to the supplemental material from the article to access the stimuli (spreadsheet form).
chinese valence & arousal ratings >11k words
Read the article by Xu and colleagues (2021), and access the word ratings (valence and arousal) by clicking on the link above.
chinese lexicon project II: >25k lex decision, naming traditional Chinese two-character words
This terrific resource by Tse et al (2022) is one of a number of recent works that are allowing researchers to make great strides in understanding lexical access and word recognition in languages other than English. Awesome work!
chinese and english six semantic dimension database: a large database of semantic ratings and its computational extension
This awesome resource from Dr. Shaonan Wang and at NYU and Dr. Nan Lin at Institute of Psychology Chinese Academy of Sciences includes ratings of 17,940 commonly used Chinese words (and phrases) and a computaional extension version consisting semantic ratings of 1,427,992 Chinese words and 1,515,633 English words on six major semantic dimensions, including vision, motor, socialness, emotion, time, and space.
chinese imageability ratings for >10k 2-character words
This work by Su and colleagues (2022) reflects imageability ratings (i.e., rate the extent to which this word conjures a mental image) for over 10,000 words and examines the relationship between imageability and other lexical variables. Click <here> for just the data!
chinese verb semantic features
Verbs are really difficult to characterize. Verbs are less imageable than nouns and have argument structures. Verb path/manner distinctions differ across natural languages, but much of what we know about verbs has been informed by English. Deng and colleagues have produced a valuable resource for analyzing semantic features of verbs in Chinese. Click above to visit their paper — data are here.
Concreteness, Imageability, Sensorimotor norms
abstract conceptual feature ratings English nouns
These reflect MTurk ratings (N>350 people) on 15 different cognitive dimensions for 750 abstract and concrete English nouns as described by our semantic space approach Frontiers in Human Neuroscience (see Troche et al, 2014; Crutch et al., 2013).
concreteness ratings for 40k English Lemmas
Here's a mammoth set of word concreteness ratings from the great Professor Marc Brysbaert and colleagues. To retrieve these word concreteness norms, click here.
concreteness ratings for 62k English multiword expressions
They’re at it again! Click on the link above for the preprint of Muraki et al (2022) Concreteness ratings for 62 thousand English multiword expressions. If you just want to get your grubby hands on the data, link here.
lancaster sensorimotor norms
Here’s a giant set of effector-specific norms for many English words from Lynott and colleagues (2019). These people are the absolute shit.
Corpora: Language Samples
candor corpus >1 TB of multimodal corpus of human speech
You want over 850 hours of conversations transcribed and segmented down to the millisecond? Well… say no more. Thanks to Reece and colleagues for making this wonderful resource publicly available in the very best spirit of science. Click to link to the preprint above.
corpus of contemporary english (coca)
The Corpus of Contemporary American English (COCA) is a very large corpus of English for you text miners and NLP folk. There’s a fee, but it’s not too exorbitant.
concretext norms: concreteness ratings for English and Italian words in context
Some pretty cool norms here reflecting concreteness ratings (how strongly can you perceive this word through the senses?) for English and Italian words in sentence contexts. Click on the link above for the PloS One article from Montefinesse and colleagues. Click <here> to access the data directly.
global vectors for word representation (glove)
From Pennington et al. (2014) out of the Stanford NLP group: “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.”
88-million-word language of conspiracy corpus (loco)
I’m almost afraid to look at this, but it looks like an amazing corpus for mining the weird language of conspiracy theorists. Be sure to check out the paper from Miani et al (2021) in Behavior Research Methods
spotify spoken language podcast corpus (spotify corpus)
47,000 hours of podcast transcriptions are downloadable here. Wow!
talkbank
AphasiaBank, DementiaBank, BilingBank — this terrific resource by Professor Brian Macwhinney and his many colleagues and collaborators is one of the best resources around for analyzing natural language samples (stories, dyads, etc.) in numerous clinical and non-clinical populations. Thanks to Professor MacWhinney for all his hard work on collecting these language samples and building the complex data structures to make them public.
Croatian
croatian >3k words on 5 emotions (Crowd-5e)
Thanks to Coso and colleagues for producing this database of affective norms for a chunky set of Croatian words. It’s pretty fascinating to think of how much cross-linguistic variability there is in lexical affect. With the Crowd-5e database, it should be possible to contrast English translation equivalents to get at this question. That’s a great idea for a master’s thesis (pssst… to you linguists out there).
Data Visualization & Graphic Design Resources
brain illustration: shading the freesurfer brain
I often find myself making illustrations of brains and highlighting particular regions for talks. Here's a document on how to do this in Photoshop using the Freesurfer brain rendering as a base.
dataviz
Technically this isn’t a psycholinguistic database or a stimuli bank but whatta resource! It’s a beautiful website organized like a decision tree for different plot and figure options along with links to R code and galleries. Thanks to software engineer Yan Holz and designer Conor Healy for this beautiful and useful resource.
google n-gram viewer
plot the frequency of any word or combination of words (n-grams) across many texts from 1800 to the present using Google’s interactive ngram viewer. Visit the ngram site here.
data visualization: plots, plots, plots
A gallery of plots generated in R with associated code from me, good old Jamie Reilly. Don’t mock.
ten simple rules for designing graphical abstracts
This 2024 article by Jambor and Bornhäuser in PLoS Computational Biology lays out some terrific guidelines for how to produce an effective graphical abstract for distilling your complex mechanisms or processing pipelines into digestible chunks. I love it!
Dictionaries and Related Lexical Resources
hunspell
*Note to self add link, dummy
urban dictionary
Today’s featured entry is “back burner bitch” or Triple B. It’s a friend who is your last resort for hanging out with (but doesn’t know it). Urban dictionary has zillions of these entries. We’ve used the urban dictionary extensively in our work on taboo word usage.
Dutch
arousal, valence, happiness, anger, fear etc. for 24k Dutch words
Is there anything that Dr. Laura Speed is not capable of accomplishing! Here she goes again with the inevitable Marc Brysbaert on this terrific set of Dutch word norms. Read the article or skip right to the data; Click here to visit their OSF site.
bank of standardized stimuli (boss): dutch names for 1400 photographs
The title pretty much says it all. Visit this work by Decuyper et al. (2021) in the Journal of Cognition (click link above). To get your grubby hands on the data directly, click here.
semantic gender norms for 24k dutch words
Semantic gender is a pretty mindblowing phenomenon. I remember taking German in high school and wondering why Tisch (table) is der tisch (masculine). Semantic gender isn’t really about that but more about priming — when you hear ‘der’ you are primed for only a subset of the lexicon relative to hearing ‘the’ which could be followed by just about anything. Read what Vankrunkelsven and colleagues have to say about how semantic gender facilitates our processing of word meaning. Oh yeah — norms too!
Embodied Cognition and Related Phenomena
affordance norms for 2825 concrete nouns
This paper takes me all the way back to reading JJ Gibson’s Ecological Approach to Visual Perception during my PhD. Those were the days. Chairs afford sitting because your butt fits into them nicely. You don’t need semantics for that. It’s affordances. Maxwell and colleagues (2024) report affordance norms for concrete nouns, no easy task! They also compared affordance norms with body interaction salience. I can’t give away the conclusion but it will SHOCK you.
calgary semantic decision project and embodied cognition ratings
Link above for category decision norms (concrete or abstract) and embodiment ratings from my Canadian idol, Dr. Penny Pexman and her co-authors Allison Heard, Ellen lloyd, and Melvin Yap. Great people. Great data.
Estonian
Concreteness ratings for 36k Estonian words
Congratulations to Proos and Aigro (2023) on their article in Behavior Research Methods (link above) reporting concreteness values for almost 36k Estonian words as derived from over 2k Estonian native speakers. The authors also contrasted these human-generated norms with a larger set of concreteness values generated by machine-learning reported by Aedmaa et al (2018). Humans and machines although strongly positively correlated in their ratings (R=.70), diverged in some pretty substantial ways. CHECK IT OUT!
Eyetracking and Vision
The Visual Experience Dataset
Wow! This awesome new resource by Dr. Michelle Greene and colleagues represents >200 hours of integrated eye movements, odometry, and egocentric video. Link to the preprint above. Access the data directly here.
Finnish
LASTU: A psycholinguistic search tool for Finnish lexical stimuli
There’s a special place in my heart for Finland. I’ve visited the country five times and have some very dear friends there. It’s a beautiful place, and Finnish is an astounding language. Thanks to Itkonen and colleages (2024) for producing this lexical database of Finnish. Link to the paper above or to view the database directly, click here. An added bonus is that the senior author is my friend, Minna Lehtonen.
French
body-object interaction norms for 3600 French nouns
Lalancette et al (2024) report these norms for all you embodied cognition freaks out there. Link to the data here!
Conceptual familiarity for 4k French nouns
Oui! Chedid et al (2019) report norms for conceptual familiarity of —- wait for it—— 4000 French nouns. This is some interesting work. There is a lot of controversy around disentangling lexical concepts from concepts. It’s nice to see more psycholinguistic norms from French coming down the pike.
morpholex-FR derivational morphology for almost 39k French words
Who doesn’t love French morphology? Actually, I am embarrassed to say that I don’t know anything about French morphology, but my ignorance shouldn’t stop you from caring about French morphology. Thanks to Maihot and colleagues (2020) for publishing this very cool paper and associated set of norms. Click above to link to their article in Behavior Research Methods.
multi-lex: a database of multi-word frequencies for French and English
What options do you have if you are interested in the frequency of a bigram like ‘football helmet’? You might just consider grabbing individual frequencies for football and helmet and generating some frankenaverage — This terrible decision would lead you deep into the valley of darkness. What you need is a specific database of multi-word utterances (tokenized by bigrams, trigrams, etc.). Armando and colleagues (2023) have done this work for you. Link to their BRM paper above.
Greek
GreekLex: a lexical database of Modern Greek
I have a confession. When I first saw this title I thought it said, ‘geeklex’ which would not have been even half as impressive as the actual work by Ktori and colleagues (2008) reporting lexical information for over 35k modern Greek words,
Hindi
shabd: a psycholinguistic database for Hindi
Verma and colleagues (2022) provide this resource based on a 1.2 billion word query of Hindi. Data include frequency counts and part-of-speech tags for a subset of words. Click on the link to access the article! Thanks to these authors for this terrific resource.
Iconicity
iconicity norms for >14k English words
This work led by the great Bodo Winter reports iconicity ratings for lots and lots of English words. This is a TERRIFIC resource for anyone interested in iconicity, sound symbolism, and related phenomena. Click <here> to access the data directly.
Italian
italian sensorimotor norms: perception & action strength >900 words
These norms by Repetto and colleagues (2022) BRM reflect perceptual and motor salience for over 900 words, adding to a growing list of modality norms. Click here to visit the data or the link in the title for the article description.
Megastudies (lex decision, naming)
chinese lexicon project II: >25k lex decision, naming traditional Chinese two-character words
This terrific resource by Tse et al (2022) is one of a number of recent works that are allowing researchers to make great strides in understanding lexical access and word recognition in languages other than English. Awesome work!
english lexicon project (lexical decision and speeded naming)
Here's another bread-and-butter psycholinguistic database from Professor David Balota at Washington University in Saint Louis. This monster has trial level naming and lexical decision data for zillions of English words.
Miscellaneous (the island of lost toys)
context availability 3k English words and associations with lexical processing
I love the old research by Schwanenflugel on context availability as an alternative to concreteness effects in lexical processing (e.g., fork conjures the context of a kitchen schema). Taylor and Colleagues (J Cognition 2022) present norms for context availability for >3k English words. Click here for the data.
general knowledge norms
Did you know that Tasmanian Devils are bioluminescent? This wasn’t one of the general world knowledge questions Coane & Umanath (2021) assessed in their norms, but I like to think that one day this little factoid will become general knowledge.
glasgow psycholinguistic norms (imageability, valence, etc.)
Normative ratings for 5,553 English words on nine psycholinguistic dimensions: arousal, valence, dominance, concreteness, imageability, familiarity, age of acquisition, semantic size, and gender association reported by Scott et al 2018 in Behavior Research Methods.
idiom norms for english and german
Link to the English-German Database of Idiom Norms (DIN). These include a set of 300 idioms and associated norms collected by Sara D. Beck & Andrea Weber (2016) at the University of Tubingen.
kitchen and food sounds: normative ratings
Oh my misophonia is sending lightning bolts through me at the very thought of someone biting into a chicken wing. Check out this work by Prada and colleagues (2024) in Behavior Research Methods. I LOVE IT. Link directly to their OSF Repository for the stimuli here.
LexiCAL: A calculator for lexical variables
Chee and colleagues (2021) published this the nuts and bolts of these tools for computing numerous properties of any corpus you feed it. Check out their article in PLoS One, which includes Python scripts for deriving the norms. I wish I knew how to program in Python a bit better than I do now. I’m stuck in the tidyverse.
LexOPS: R-Package & User Interface for the Controlled Generation of Word Stimuli
One database to rule them all! Where was this when I was trying to match stimuli for my doctoral dissertation. A Shiny app interface, too? Get out of town! Link to the PsyArXiv preprint here.
mrc psycholinguistic database
Here's the queen mother of psycholinguistic databases from the MRC/CBU . Many of the word frequency and concreteness measures are too dated at this point, but the filtering features, concreteness, familiarity, etc. make this wonderful resource tough to beat.
oddity detection in real world scenes (ODDS database)
Click on the link above to access the OSF site and database of real world scenes by Hout and colleagues (2022). This is like a YUGE pile of real world Where’s Waldo photos where they don’t tell you what Waldo looks like. I love it!
prevalence ratings >60k words
Brysbaert and colleagues (2019) reported prevalence estimates for 61,800 words. ‘Prevalence’ is the relative proportion of people who know a particular word normalized using a probit transformation (see Marc Brysbaert’s webpage for a simple explanation). Prevalence can provide complementary information to word frequency and familiarity.
scope: the south carolina metabase
Per Gao and colleages… “The South CarOlina Psycholinguistic MEtabase (SCOPE) is a curated collection of psycholinguistic properties of words from major databases. It currently contains more than 200 variables and over 79,000 words and nonwords”. Read the preprint while it’s hot! Anything from Rutvik Desai’s group is off the hook. This metabase approach is the way to go….
word.norms database aggregation
Professor Erin Buchanan’s terrific resource from the Doom Lab pooling databases for word associations, frequency, etc. Link to the word norm database to specify ranges and generate your own output.
Morphology and Compounding
ladec: large database of english compounds
If you’re out taking your bulldog for a walk and want a catfish sandwich, look no further than this database of >8000 English compound words from Gagné, CL., Spalding, TL., & Schmidtke, D. (2019). LADEC: Large database of English compounds. Behaviour Research Methods. Link to the data here. Link above to the article in BRM.
morpholex english
English morphology for 70k-ish words as reported by Sánchez-Gutiérrez et al (2017).
Narrative Stimuli, Stories
aesop’s fables
Check out this paper by Ward et al (2015) examining the effects of age, acoustic challenge, and verbal working memory on recall of narrative speech. Audio files matched on all sorts of shit, and Aesop too. What’s not to like?
fMRI open narratives
Holy moly! The best of open science is upon us. Here’s a massive set of functional imaging data on naturalistic speech comprehension in the scanner. Thanks to Professor Uri Hasson for maintaining this resource. To read the paper and related documentation, see Nastase et al (2019), OpenNeuro, ds002345. https://doi.org/10.18112/openneuro.ds002345.v1.1.3
stories! NYU-BU contextually controlled stories corpus
These spoken narratives reflect meticulous experimental control as reported by Lewis and colleagues. The stimuli consist of, “16 high-quality recordings of 8 unique stories, spoken both by a female and a male actor. Each story consists of 128 sentences (~2000 words per story) organized around critical keywords, which have been matched along multiple linguistic dimensions”
Nonword & Pseudoword Stimuli
ARC nonword Database
Working on a lexical decision task and need some weird nonword foils? Hmmm... but what if I need to specify some weird orthotactic constraints on my nonword stimuli? Never fear. The bulk of the work has been done for you with the ARC nonword database.
klingon pocket dictionary
I am embarrassed to admit this, but I hit the Klingon dictionary for nonwords for one of the lexical decision experiments in my doctoral dissertation. So what! Just live with it.
wuggy multilingual pseudoword generator
This very cool database generates pseudowords aligned with the phonotactic rules of a specific target language as outlined in the paper from Keuleers, E., & Brysbaert, M. (2010).
Norwegian
Norwegian words: A lexical database for clinicians and researchers
This resource from Lind and colleagues (2015) appeared Clinical Linguistics and Phonology. The database contains extensive phonological and morphological coding for 1600 or so Norwegian words.
Picture Stimuli & Naming Norms
age positive image gallery
There are so many things to like about this gallery of free images depicting a wide variety of older adults in a positive light! Click on the link above to access the photo gallery (no associated article or norming data).
flaticon
>10.5 million free icons if you need an icon. It’s not a scientific source, but don’t be such a snob. Did you know that Dr. Amy Vogel Eyny’s sister adds fake reviews to her Rate My Professor page? For real!
international picture naming database (IPNP)
Here's a great set of pictures of actions (n=275) and objects (n=520) normed across multiple languages from the International Picture Naming Project out of UCSD.
multi-language written picture naming dataset
This triumph by Torrance and colleagues (2017) involves trial level reaction time and naming agreement norms for the 260 colorized Snodgrass and Vanderwart pictures by Rossion and Pourtois (2004) for over 1200 participants across the following languages: (Bulgarian, Dutch, English, Finnish, French, German, Greek, Icelandic, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish). WOW!!!!
multipic
The Multilingual Picture (MultiPic) databank is the result of an international collaborative project intended to provide the scientific community with a set of publicly available 750 drawings from common concrete concepts created by the same author, standardized for name agreement and visual complexity in several languages. See Duñabeitia, J.A., Crepaldi, D., Meyer, A.S., New, B., Pliatsikas, C., Smolka, E., & Brysbaert, M. (in press). MultiPic: A standardized set of 750 drawings with norms for six European languages.Quarterly Journal of Experimental Psychology.
noun project
For one million icons and symbols, link here. There’s a subscription fee for use, but it won’t break your bank unless you live in Silicon Valley (burn).
pisces pictures with social context and emotional scenes
Click on the link above for some great scene stimuli by Teh and Colleagues (2018) normed for emotional valence, intensity, and social engagement. Someone must have been visiting my family at Thanksgiving to get this dark...
proper noun and place name norms in younger and older adults
This work by Souza et al (2022) examined naming in younger and older adults for famous places and people. These norms are culturally adapted and include all the usual suspects in terms of familiarity, etc.
scidraw
A database of free pictures for scientific presentations and whatever else you like. These aren’t normed or anything, but they might be useful for someone, and God if they aren’t cute. Check out the rats boxing c/o Antonis Asiminas
things database 1854 object concepts, 26k natural images
Click on the link above to access the PLoS One article from Hebert and colleagues (2019). If you can’t wait to get your grubby hands on the stimuli, then click here to visit the OSF site.
sun scene database (mit)
Need a picture of a beach or a kitchen -- or 15,000 other naturalistic scenes? Here's the database for you.
R Packages for Language Science (links to github pages)
ConversationAlign
Reads dyadic transcripts, yokes numeric values to each word, and computes indices of alignment across pairs of interlocutors. Install the package from github using devtools.
curser
Generates novel combinations or curse words and common nouns using algorithms described in Reilly J, Kelly A, Zuckerman B, *Twigg P, *Wells M, *Jobson K, & *Flurie M (2020) Building the perfect curse word: A psycholinguistic investigation of the form and meaning of taboo words. Psychonomic Bulletin & Review. 27(1).
semdistflow
Computes running bigram semantic distances for every pair of words in any length text you feed the program. Uses algorithms described in Reilly J, *Finley AM, Litovsky C, & Kennett Y (2023) Bigram semantic distance in continuous language narratives: Theory, method, applications. Journal of Experimental Psychology: General. 152(9), 2578-2590.
usapresidentialdebates
Two-party presidential debate transcripts from 1960 to 2020 with metadata on the candidates (e.g., party, party winner, age) and economic indicators (e.g., GDP). Package is optimized for use with its companion R package, ConversationAlign.
Reading, Spelling, Orthography
false fonts: a compendium of fonts
Here ‘s a really useful set of novel constructed orthographies from the FontStruct website. We are planning on using one of these fonts as a lower level visual baseline for English orthography (similar luminance and complexity, minus the semantics) for an EEG study we are soon launching.
orthographic consistency norms for 37k English words
With all the Yachts, Colonels, and Wednesdays in the English language, it’s a wonder anyone ever learns to read. Chee and colleagues (2020) report norms for feedforward (spelling-to-sound) and feedback (sound-to-spelling) consistency among 37,677 English words. This is a terrific resource for anyone investigating reading, writing, and disorders thereof.
spelling-to-pronunciation norms for 20k English words
English is such a tangled web. It’s a wonder any of us ever learns to read. Check out this work by Edwards and colleagues (2023) in BRM. Visit the authors’ github repo to steal the data for the low cost of free!
text readability via CLEAR: a corpus of normed reading passages
Visit Crossley et al (2022). A large-scale corpus for assessing text readability to read about this dataset. Click on the link above to get your hands on the reading passages and all of the beautiful analytics on the complexity of each passage.
Reilly Lab: Demos and R Tutorials
I am a big proponent of creating open source tools and helping people out with their research as much as I can. My lab also has a YouTube channel with some tutorials and whatnot. Here are some tutorials we have developed that I’ve found helpful over the years.
Semantic Category Norms, Features, & Networks
decompositional semantics initiative
I confess. I study semantic memory but have never taken a formal semantics class. It’s pretty obvious if you’ve read anything I’ve written that most of the time I have no idea what I’m talking about, but you can bet that these people do. Visit to find out all about the many tools these computational linguists and computer scientists have created for elucidating semantic composition.
Feats: A database of semantic features for early produced noun concepts
Congratulations to Borovsky and colleagues (2023) on their recent publication in BRM! I’ll let the authors explain these norms in their own words, “Feats—a tool that was designed to make headway on these challenges by providing a database, the Language Learning and Meaning Acquisition (LLaMA) lab Noun Norms that extends a widely used set of feature norms McRae et al. Behavior Research Methods 37, 547–559, (2005) to include full coverage of noun concepts on a commonly used early vocabulary assessment” — Ken McRae is the big daddy of all semantic feature norms, so this dataset is bound to be terrific.
a large database of semantic norms and their computational extension
How cool is this article from Wang and colleagues (2023) appearing in Nature Scientific Data? Using embeddings to extrapolate semantic ratings is such a cool idea. From the authors: “Six Semantic Dimension Database (SSDD), which contains subjective ratings for 17,940 commonly used Chinese words on six major semantic dimensions: vision, motor, socialness, emotion, time, and space. Furthermore, using computational models to learn the mapping relations between subjective ratings and word embeddings, we include the estimated semantic ratings for 1,427,992 Chinese and 1,515,633 English words in the SSDD” -
semantic congruency norms: object-scene matching the ObScene database
I honestly can’t think of a database with a better name than this one — Ob- for object and scene for scene makes obscene. Love this. Andrade and colleagues report 898 object-scene pairs (e.g., suitcase-airport). I can think of so many uses for these stimuli! Visit their OSF site to access the data directly.
semantic category production norms for 117 concrete and abstract categories
Thanks to Banks and Connell (2022) for publishing this massive dataset of semantic category norms. People produced as many exemplars as they could in 60 seconds for 117 categories. It’s like a GIANT verbal fluency task. Read the paper above. Like to the OSF HERE for the data.
semantic distance web interface (snaut norms)
This simple web-based interface allows the user to derive distances between words or documents based on a continuous bag of words (CBOW) embedding model trained on subtitles for English and Dutch. For methods, see Mandera, P., Keuleers, E., & Brysbaert, M. (in press). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language.
semantic feature coding for concrete and abstract concepts in FMRI
Wang and colleagues (2022) present feature norms for 600+ concepts by 11 participants undergoing fMRI. The concepts are abstract and concrete, and the features were generated offline using crowdsourcing. Click on the link above to access the data on Open Neuro. Very cool!
semantic feature production norms for 4436 English concepts
We have the great Erin Buchanan and a gang of roving misfits to thank for this extended database of feature production norms for 4436 concepts. Here’s a link to the paper in Behavior Research Methods. Click on the title above to link to the data on the OSF.
semantic feature norms for manipulable objects
Thanks to Valerio et al (2023) for publishing these norms. Read the article in Cognitive Neuropsychology all you embodied cognition freaks. 130 participants, 80 objects. Link to the database on the OSF here!
synonymy and semantic feature generation in younger and older adults
Here’s some great data from two fantastic people. Read the 2022 paper by Wei Wu and Paul Hoffman in Royal Society Open Science. Link to their data on the OSF by clicking above. These scholars report synonymy judgments and feature matching provided by 200 older and younger adults.
wordnet
The shadowy company that created the Terminator? Or was that Skynet? No matter… Wordnet has been around for a long time. It’s one of those bread-and-butter sources for constructing semantic networks. Wordnet plots distances between many English words using something called synsets. I think they’re synonym-like, but what do I know? Link to Wordnet above to find out.
Sentence Processing & Syntax
cloze probability, predictability, and alignment w/ EEG for 205 English sentences
“Cat on a hot tin ______” —- This feels like a USA Today crossword puzzle clue, but it’s also a good demonstration of cloze probability. ‘Roof’ in this context is a highly constrained candidate. Violations of cloze probability expectations (e.g., cat on a hot tin banana) are a long love of EEG language researchers. These sentences from Varga et al. are very carefully aligned WRT to cloze probabilty and predictability.
sentence completion norms
Need some sentence completion norms? Don’t we all! Well, first read this paper by Peelle et al. (2020) if you can stomach it. Then link to the stimuli by clicking above.
Spanish
SPALEX: A Spanish Lexical Decision Database From a Massive Online Data Collection
Pretty amazing work by Aguasvivas et al (2018) appearing in Frontiers in Psychology representing a welcome megatstudy of word recognition latencies as judged by lexical decision (i.e., Is this a word? Y/N) for Spanish.
spanish positive emotion norms
I love this work by Hinojosa and colleagues in BRM (2023, in press). The authors report norms for 9000 Spanish words across 7 positive emotions. This is an awesome resource for you affect-heads out there. LINK HERE to access the data.
spanish verb naming norms for psycholinguistic and motor content variables
Link above to the data reported by San Miguel-Abella et al 2021 for verb naming. These norms include over 4000 Spanish verbs — This is an awesome resource
Speech Perception and Speech Reading (aka lip reading)
Mouth and Facial Informativeness Norms for 2,276 English Words
Face/mouth visual articulatory norms for thousands of English words. This terrific work by Krason and colleagues in Behavior Research Methods includes norms for visual articulatory salience of spoken words derived from mturkers viewing silent videos of people speaking. Click here to zap you right over to the data.
Symbol Processing & Symbolic Cognition
symcog: an open source toolkit for assessing human symbolic cognition
Click on this awesome resource by Flurie et al (2022) which includes a set of Heider-Simmel like animations depicting abstract concepts such as heaviness. There’s also an extensive list of concepts without words. I should know - I’m a co-author on this!
Tabooness
tabooness language across the globe: A multilab study
I am a giant fan of Marco Marelli, Melvin Yap, Marc Brysbaert, and many of the other authors on this powerpacked effort to create a cross-linguistic database of tabooness ratings. 13 languages, 17 countries, >40 native speakers at each site rating tabooness and many other features of the lexical stimuli. Read the BRM article above. Grab the data HERE.
tabooness norms for american english
Click on the link above to access norms for word length and other formal (e.g, phonological) variables for a set of taboo words. WARNING: includes hate speech. This database reflects 1205 English high frequency words coded across 22 psycholinguistic variables.Click HERE to download data on combinatorial cursing (i.e., what makes a good combination of a curse word and a common noun in American English). We reported these data in Reilly et al (2020).
Turkish
(taco) A Turkish database for abstract concepts
This awesome work by Conca and colleagues (2024) in Behavior Research Methods in their own words…”We included 503 words-78 concrete (fruits, animals, tools) and 425 abstract (emotions, social, mental states, theoretical, quantity, space, political)-rated by 134 Turkish speakers for familiarity, imageability, age of acquisition, valence, arousal, quantity, space, theoretical, social, mental state, and political dimensions”. Click on the link to take you to the article or HERE to take you right to the norms.
Welsh
SUBTLEX-CY: A new word frequency database for Welsh
So cool! Word frequency norms for Welsh as reported by van Heuven et al (2023) based on a >30million word corpus of Welsh subtitles. Here’s the weird thing about generating frequency norms from news subtitles — you tend to radically underestimate the prevalence of cursing in daily life since many countries impose restrictions on what you can/can’t say in media.
Word Associations
small world of words English word association norms for 12k words
De Deyne and colleagues have been collecting word association norms a la Doug Nelson’s classic USF norms for the past few years. They now have word associations for 12,000 words. That’s yuge!
university of south florida (USF) word association norms
When I say ‘dog’, what’s the first word that comes to mind? That’s word association, and it tells us a lot about language and semantic memory. The USF database is the OG of word association norms from Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms.
words (Dutch and English) with corresponding pictures matched on visual and semantic similarity
This is a terrific resource for those of us interested in interactions between language, vision, and semantic memory. Thanks to Falk Huettig for sending these our way! The stimuli and matching procedures are described by de Groot and colleagues (2015). Think picture tetrads varied in semantic similarity.
Word Frequency (words & multiword utterances)
subtlex US
Everybody uses these word frequency norms from Marc Brysbaert! These word frequency norms reflect frequency counts derived from movie and news subtitles. If you’re using CELEX or Kucera and Francis, drop those zeroes and get with the hero.
multilex: word frequency for multiword utterances in French and English
It is difficult enough just to interpret a frequency value for one word (e.g., dog). It could be a noun or a verb or have an alternate meaning altogether. Multilex moves beyond single words to produce frequency values for multiword utterances in English and in French. The authors made creative use of the Google n-gram database here.
WorldLex Blog, Twitter and Newspapers Word Frequencies for 66 languages
Linguists rejoice! Lexical frequency data across many natural languages scraped from Twitter, Blogs, and other such media as reported by Gimenes, M., & New, B. (2015).