3 Ways to Judge Success Besides Accuracy: Precision, Recall, and F Scores

Accuracy is both a key performance indicator and critical for company success, but many times, people mean different things by accuracy. Accuracy isn’t the only important summary statistic for all situations. For different applications, measures like precision, recall, and false discovery rates can be better metrics. To explain some of the metrics we use to evaluate our software’s performance, I’ll use an example of a ten question True/False pop quiz.

A high school social studies teacher is preparing a unit on evaluating news articles by their headlines and first paragraphs. They gather articles from news and culture magazines, tabloids, satire websites, and newspapers then prepare daily 10 question quizzes for their class.

The following table will help us understand our different statistics. The columns of the table correspond to the correct answer on the quiz, while the rows indicate how students answered the questions on the quiz.

nathan1.png

In many classrooms, accuracy is used to assign the grade on such a quiz. For each right answer you get one point and your score on the quiz is taken out of 10. You get 8 answers right, you get an 80% on the quiz. To be mathematically explicit, accuracy is the number of answers correctly assigned to true plus the answers correctly assigned false, divided by the total number of questions.

The teacher has grown tired of the traditional scoring metrics. They instead decide that it is most important that the students don’t disbelieve true news stories. To incentivize this, the teacher will only grade questions for which the answer is True. “It’s more important that my students correctly identify the True statements as True. They should catch every True story, so I’ll grade them that way,” they think to themselves. The quiz grade will be the number of answers correctly assigned to true divided by all number of questions where the correct answer is True (true positives + false negatives). This is known as the true positive rate (TPR), sensitivity, or recall.

The students in our hypothetical class see a beautiful flaw in the teachers plan; if they answer all the questions as True, they will get 100% on their quiz because they have no false negatives! After grading quiz 1, our hypothetical instructor reconsiders.

“I was wrong,” the teacher considers. “It is important that my students not believe everything they read. For quiz 2, I’ll grade them on just on the questions they mark as True. Their skepticism is most important.” The quiz grade in this case is the number of answers correctly assigned to true divided by the number of answers the class answered as True (true positives + false positives). This measure is known as the Positive Predictive Value (PPV), or precision.

The students get word of their teacher’s new plan and decide that they will pick only the news article they are most sure is true and then answer false for all the other answers. As long as they get that one question right, they will get 100% on the quiz (since they won’t have any false positives). After one too many students writes, “I know one thing; that I know nothing” on their quiz, the teacher decides there must be a better way.

After doing some research, the fearless teacher finally settles on a statistic to use for quiz 3: the F1 score. There are a few different ways to understand the F1 score. It always has a value within recall and is closer to the lower of those two values. The F­­­1 score still rewards correct positive answers more than correct negative answers, but because both precision and recall are important, extreme strategies won’t work as well. While it isn’t necessary to understand the mathematical details behind the F1 score to grasp the concept. But for those who are interested, the formula for the F1 score is:

nathan2.png

Our poor students aren’t sure what to do with this grading system and they develop different strategies. Some students use the strategy from the first quiz, answering everything correct. Others use the strategy from the second quiz, trying to make sure they get very few false positives. The remaining students do their best, but have different levels of skepticism.

The trusting student believed all the true news articles and some of the false ones.

The balanced student was just as likely to believe the false articles as to disbelieve the true articles.

The skeptical student wasn’t fooled by any of the false articles but didn’t believe all the true ones either.

The following table gives an example of how these descriptive statistics can be different for differently behaving students. Notice that even though the trusting, balanced, and skeptical students all have identical accuracy, their recall, precision, and F1 scores dramatically different.

nathan3.png

While this example fits a theoretical classroom better than a real one, these and other similar statistical measures are used in medical testing, manufacturing quality control, comparing machine learning models, and information extraction and retrieval problems. The relative importance of recall (getting all the right answers right) and precision (minimizing false positives) varies dramatically depending on the business case. For our redaction projects, it is much more important to catch as much sensitive information as possible (high recall). For some of our indexing projects and document classification, it is sometimes better for us to report nothing or use a default “Unknown” answer than return a false positive (high precision).

The F1 score equally balances recall and precision, but depending on the business case, we can use a more general version of this statistic, Fβ (F-beta). Beta just lets us know how much extra weight to give false negatives over false positives. F2 weights recall higher that precision (increasing the importance of false negatives), while F0.5­­­­ weights precision higher (decreasing the importance of false negatives). The following graph compares how the Fβ scores of different quiz strategies changes as precision and recall change in relative importance. By changing the beta parameter, our hypothetical teacher can adjust the relative rewards of skepticism and taking in new information and we can fine tune our models to fit our clients’ business needs.

nathan4.png

At Extract Systems, we work with clients to determine the relative importance of false negatives and false positives and how that impacts our solutions for their problems. As a data capture analyst, I optimize our software for each client’s specific business rules, balancing accuracy, false positives, and false negatives to minimize personnel time and mission critical errors. If you’d like to learn more about how we develop our algorithms, weight things like false positives, or see a demo of our software, please reach out today.


ABOUT THE AUTHOR: NATHAN NEFF-MALLON

Nathan is a Data Capture Analyist at Extract with experience in data analysis, software development, machine learning, teaching, and lasers. He earned his Bachelors at Whitman College in Walla Walla, Washington and his Masters in Chemistry from University of Wisconsin – Madison. Nathan enjoys statistics, building models, and glassblowing.