quanteda: Quantitative Analysis of Textual Data
Kenneth Benoit and Paul Nulty
2014-10-10
This vignette serves as a brief introduction to the core features of quanteda. For a comprehensive guide to quantitative text analysis in R, see the (work-in-progress) book here
Introduction
The Rationale for quanteda
quanteda1 is an R package designed to simplify the process of quantitative analysis of text from start to finish, making it possible to turn texts into a structured corpus, conver this corpus into a quantitative matrix of features extracted from the texts, and to perform a variety of quanttative analyses on this matrix.
The object is inference about the data contained in the texts, whether this means describing characteristics of the texts, inferring quantities of interests about the texts of their authors, or determining the tone or topics contained in the texts. The emphasis of quanteda is on simplicity: creating a corpus to manage texts and variables attached to these texts in a straightforward way, and providing powerful tools to extract features from this corpus that can be analyzed using quantitative techniques.
The tools for getting texts into a corpus object include:
loading texts from directories of individual files
loading texts “manually’’ by inserting them into a corpus using helper functions
managing text encodings and conversions from source files into corpus texts
attaching variables to each text that can be used for grouping, reorganizing a corpus, or simply recording additional information to supplement quantitative analyses with non-textual data
recording meta-data about the sources and creation details for the corpus.
The tools for working with a corpus include:
summarizing the corpus in terms of its language units
reshaping the corpus into smaller units or more aggregated units
adding to or extracting subsets of a corpus
resampling texts of the corpus, for example for use in non-parametric bootstrapping of the texts
Easy extraction and saving, as a new data frame or corpus, key words in context (KWIC)
For extracting features from a corpus, quanteda provides the following tools:
extraction of word types
extraction of word n-grams
extraction of dictionary entries from user-defined dictionaries