Supervised Clustering Tutorial

What It Does

Classifies a set of documents as relevant or non-relevant to a topic. Assigns discrete (1-6) priority scores to each document.

When to Use It

Best used for document classification and prioritization when a relatively low level of training resources are available. This function helps to:

Decide which documents to keep and which to discard.
Sort documents in order of relevance to a topic.

Requires a limited set (as few as 25-50 relevant documents) of seed documents annotated as relevant to the topic of interest. If the seed studies are randomly chosen from the larger pool of unclassified documents, predictions will be unbiased.

How to Use It

Direct DoCTER to the input file.
Specify the column in which the text of the documents occurs.
Specify the desired recall threshold.
Specify any nuisance words (Stopwords) to be removed from consideration, if any.
Select preferred mode of output (Ensemble Only is recommended).

How to Interpret the Output

Your browser will display a model performance table which presents various statistical performance – the most important is typically the predicted recall of the ensemble. In addition, the output csv available for download will contain all the original columns in the input csv, plus two additional output columns in ensemble mode:

An "Ensemble Score" column containing a score from 0-6 in increasing order of relevance, and
An "Ensemble_AnyOnePositive" column containing a binary (0/1; non-relevant/relevant) prediction of relevance for each document.

Documents with a 0 score may be discarded with the expectation that the remaining documents will achieve the desired recall threshold.

Input and Output File Formats

Overview of Input and Output file formats.

Topic Extraction (Clustering)

Clusters a set of documents into a user-specified number of bins. For each bin, identifies the defining topics/keywords.

Machine Learning

Classifies a set of documents as relevant or non-relevant to a topic. Assesses the probability of a document being relevant to the topic of interest.