Topic Extraction Tutorial

What It Does

Clusters a set of documents into a user-specified number of bins. For each bin, identifies the defining topics/keywords.

When to Use It

To explore the major topics in a diverse batch of documents. The algorithm/s do not require training (no document annotation is needed) so topic extraction can serve as a rapid assessment tool.

How to Use It

Direct DoCTER to the input file
Specify the column in which the text of the documents to be clustered occurs.
Specify the desired number of clusters (10 is typically a good choice).
Specify word grouping/phrase length (2 is typically a good choice).
Specify any nuisance words (Stopwords) to be removed from consideration, if any.
Specify choice of algorithm (each offers advantages in different contexts, but LDA requires a longer run time).

How to Interpret the Output

Your browser will display a topic table which presents the number of documents in each cluster as well as the defining keywords/topic signature of each cluster. In addition, the output csv available for download will contain all the original columns in the input csv, plus a "Topic" column indicating the bin to which each document belongs. The output csv will also contain the topic table described above appended to the right of the Topic column.

Input and Output File Formats

Overview of Input and Output file formats.

Supervised Clustering

Classifies a set of documents as relevant or non-relevant to a topic. Assigns discrete (1-6) priority scores to each document.

Machine Learning

Classifies a set of documents as relevant or non-relevant to a topic. Assesses the probability of a document being relevant to the topic of interest.