quanteda: Quantitative Analysis of Textual Data

quanteda: Quantitative Analysis of Textual Data

Kenneth Benoit and Paul Nulty

2014-10-10

This vignette serves as a brief introduction to the core features of quanteda. For a comprehensive guide to quantitative text analysis in R, see the (work-in-progress) book here

Introduction

The Rationale for quanteda

quanteda1 is an R package designed to simplify the process of quantitative analysis of text from start to finish, making it possible to turn texts into a structured corpus, conver this corpus into a quantitative matrix of features extracted from the texts, and to perform a variety of quanttative analyses on this matrix.

The object is inference about the data contained in the texts, whether this means describing characteristics of the texts, inferring quantities of interests about the texts of their authors, or determining the tone or topics contained in the texts. The emphasis of quanteda is on simplicity: creating a corpus to manage texts and variables attached to these texts in a straightforward way, and providing powerful tools to extract features from this corpus that can be analyzed using quantitative techniques.

The tools for getting texts into a corpus object include:

loading texts from directories of individual files

loading texts “manually’’ by inserting them into a corpus using helper functions

managing text encodings and conversions from source files into corpus texts

attaching variables to each text that can be used for grouping, reorganizing a corpus, or simply recording additional information to supplement quantitative analyses with non-textual data

recording meta-data about the sources and creation details for the corpus.

The tools for working with a corpus include:

summarizing the corpus in terms of its language units

reshaping the corpus into smaller units or more aggregated units

adding to or extracting subsets of a corpus

resampling texts of the corpus, for example for use in non-parametric bootstrapping of the texts

Easy extraction and saving, as a new data frame or corpus, key words in context (KWIC)

For extracting features from a corpus, quanteda provides the following tools:

extraction of word types

extraction of word n-grams

extraction of dictionary entries from user-defined dictionaries

via quanteda: Quantitative Analysis of Textual Data.