Supervised Clustering Tutorial

What It Does

Classifies a set of documents as relevant or non-relevant to a topic. Assigns discrete (1-6) priority scores to each document.

When to Use It

Best used for document classification and prioritization when a relatively low level of training resources are available. This function helps to:

Requires a limited set (as few as 25-50 relevant documents) of seed documents annotated as relevant to the topic of interest. If the seed studies are randomly chosen from the larger pool of unclassified documents, predictions will be unbiased.

How to Use It

  1. Direct DoCTER to the input file.
  2. Specify the column in which the text of the documents occurs.
  3. Specify the desired recall threshold.
  4. Specify any nuisance words (Stopwords) to be removed from consideration, if any.
  5. Select preferred mode of output (Ensemble Only is recommended).

How to Interpret the Output

Your browser will display a model performance table which presents various statistical performance – the most important is typically the predicted recall of the ensemble. In addition, the output csv available for download will contain all the original columns in the input csv, plus two additional output columns in ensemble mode:

  1. An "Ensemble Score" column containing a score from 0-6 in increasing order of relevance, and
  2. An "Ensemble_AnyOnePositive" column containing a binary (0/1; non-relevant/relevant) prediction of relevance for each document.

Documents with a 0 score may be discarded with the expectation that the remaining documents will achieve the desired recall threshold.