Machine Learning Tutorial

What It Does

Classifies a set of documents as relevant or non-relevant to a topic. Assesses the probability of a document being relevant to the topic of interest.

When to Use It

Best used for document classification and prioritization when moderate-to-high training resources are available. This functions helps

  1. To decide which documents to keep and which to discard, and
  2. To sort documents in order of relevance to a topic

You must have a training dataset (a set of documents annotated as being relevant or not to the topic of interest) to train the machine learning algorithm. At least 100 training documents are recommended with a minimum of 25 relevant documents; more training data will produce better results. If the training data are randomly chosen from the larger pool of unclassified documents, prediction will be unbiased.

How to Use It

  1. Direct DoCTER to the training file
  2. Specify the column in which the text of the documents in the training file occurs.
  3. Specify the column containing the annotation (0/1; 0 for non-relevant, 1 for relevant) of the training documents.
  4. Direct DoCTER to the unclassified (non-annotated) data.
  5. Specify the column containing the text in the unclassified data file.
  6. Specify desired recall (sensitivity) threshold (0.01-0.99)
  7. Specify any nuisance words (Stopwords) to be removed from consideration, if any.

How to Interpret the Output

Your browser will display a model performance table which presents various statistical performance metrics. In addition, the output csv available for download will contain all the original columns in the input csv, plus two additional output columns:

  1. A "_Pred" column containing a 0/1 (non-relevant/relevant) classification for each document, and
  2. A "_Prob" column containing the probability score of relevance for each document.