Jamie’s Text Mining with R
Tutorials
These tutorials reflect course content from Introduction to Text Mining (CSCD 5000, Fall, 2020) at Temple University.
accessing documents (janeaustenr, gutenbergr, harrypotter)
Finding corpora and importing them into R is not always easy. Some of these packages provide access to huge amounts of text, but there’s always a catch!
Characters and string basic operations: Part I
An introduction to working with characters and strings
cleaning, stripping, and prepping text: our custom cleaning genie
Half the challenge with R is getting the data read in and cleaned up into a form where you can actually do the analyses you hope to do. This document reflects our lab’s efforts at developing a text stripping and cleaning function. We call it the cleaning genie. I hope to one day implement the genie as a Shiny app but for now here it is in all its raw regex glory.
Code snippets specialized for text-based analysis
Working with text (e.g., characters, strings) in R has its unique syntax relative to more “generic” code for statistical modeling. This document represents examples of code snippets that I have found helpful for dealing with thorny recurring issues in text mining.
document classification
building simple classifiers (e.g., naive bayes) using Quanteda that “learn” a particular distinction during supervised training and apply the model’s predictions to new documents.
lexical diversity metrics applied to Harry potter novels
type token ratios come in many flavors and are the bread-and-butter of many narrative analyses. The Quanteda package offers numerous off the shelf TTR variants. This document reflects their application in the context of a personal curiosity — Namely, did JK Rowling’s vocabulary repertoire increase as she crafted the series? For this we are using moving averages to examine lexical diversity both within and between novels including the Sorcerer’s Stone, Prisoner of Azakaban, and the Deathly Hallows.
Ngrams
An introduction to ngrams applied to Edgar Allan Poe’s The Raven
quanteda and working with multi-document corpora
how to read multiple text files, convert to corpus objects, document feature matrices (DFM), cosine distances between documents and features, simple topic modeling.
regular expressions
Grep your way through an introduction to regular expressions.
topic models
Using the SOTU package, we will analyze topics within the state of the union addresses by one president and then many presidents at once.
Word Cloud
Generate a simple wordcloud based on lexical frequency within a document.
plot arousal ratings for every word of Goodfellas
Read in the script, remove stopwords, split and unlist the script into a long vector of single content words. Yoke arousal values from the Warriner et al database to each word of Goodfellas in the order it appeared in the movie. Then plot the arousal values coloring bars by whether the word is a curse or not. FUN STUFF.
auto-segmenting verbal fluency data using semantic distance
This script takes a time series of category fluency data and breaks it into clusters using semantic distance
Sample text mining analyses
The following documents represent samples of work we have done in the lab. Note — this work was not peer reviewed. There are likely many errors, and we make no claims regarding validity of the findings. These documents represent stops/starts and lessons learned. We are grateful for feedback or suggestions for improving anything you encounter here.
Unabomber Manifesto
This document represents a simple sentiment analysis, word cloud, and frequency output using a bag-of-words approach on Theodore Kaczynski’s essay, Industrial Society and its Future.
Are these lyrics from Katy Perry or Metallica?