Machine Learning Tutorial

What It Does

Classifies a set of documents as relevant or non-relevant to a topic. Assesses the probability of a document being relevant to the topic of interest.

When to Use It

Best used for document classification and prioritization when moderate-to-high training resources are available. This functions helps

To decide which documents to keep and which to discard, and
To sort documents in order of relevance to a topic

You must have a training dataset (a set of documents annotated as being relevant or not to the topic of interest) to train the machine learning algorithm. At least 100 training documents are recommended with a minimum of 25 relevant documents; more training data will produce better results. If the training data are randomly chosen from the larger pool of unclassified documents, prediction will be unbiased.

How to Use It

Direct DoCTER to the training file
Specify the column in which the text of the documents in the training file occurs.
Specify the column containing the annotation (0/1; 0 for non-relevant, 1 for relevant) of the training documents.
Direct DoCTER to the unclassified (non-annotated) data.
Specify the column containing the text in the unclassified data file.
Specify desired recall (sensitivity) threshold (0.01-0.99)
Specify any nuisance words (Stopwords) to be removed from consideration, if any.

How to Interpret the Output

Your browser will display a model performance table which presents various statistical performance metrics. In addition, the output csv available for download will contain all the original columns in the input csv, plus two additional output columns:

A "_Pred" column containing a 0/1 (non-relevant/relevant) classification for each document, and
A "_Prob" column containing the probability score of relevance for each document.

Input and Output File Formats

Overview of Input and Output file formats.

Topic Extraction (Clustering)

Clusters a set of documents into a user-specified number of bins. For each bin, identifies the defining topics/keywords.

Supervised Clustering

Classifies a set of documents as relevant or non-relevant to a topic. Assigns discrete (1-6) priority scores to each document.