Text Summarization:

How To Calculate Rouge Score

6 min readAug 20, 2023

This article is written by Alparslan Mesri and Eren Kızılırmak.

What is ROUGE score ?

Recall-Oriented Understudy for Gisting Evaluation, often referred as ROUGE score, is a metric used to evaluate text summarization and translation models. There are variations of ROUGE scores. In this article we will show how to calculate ROUGE-N, ROUGE-L and mention other types of ROUGE scores.

Since we can not compare outputs of the text data directly as if it’s a binary classification, we can calculate overlapping words between predictions(candidates) and reference summaries and their weights respectively. The reference summaries are the manually written summaries by linguistics experts. There can be more than one summary for each text.

ROUGE scores can be calculated using hugging face’s evaluate library.

from evaluate import load
# Load the ROUGE metric
import evaluate
rouge = evaluate.load('rouge')
candidates = ["Summarization is cool","I love Machine Learning","Good night"]

references = [["Summarization is beneficial and cool","Summarization saves time"],
["People are getting used to Machine Learning","I think i love Machine Learning"],
["Good night everyone!","Night!"]
             ]
results = rouge.compute(predictions=candidates, references=references)
print(results)

Output of the given example is :

{'rouge1': 0.7833333333333332, 'rouge2': 0.5833333333333334, 
'rougeL': 0.7833333333333332, 'rougeLsum': 0.7833333333333332}

Let’s calculate it by hand and check if it’s correct.

ROUGE-N

N indicates the number of N grams which can be 1 and 2. For ROUGE-1 it is the number of words. In ROUGE-2 it is the number of bigrams for example the number of grams in candidate 2 is :

##for unigrams it is 
(I),(love),(Machine),(Learning)
##for bigrams it is 
(I love),(love Machine),(Machine Learning)

In order to calculate ROUGE scores we need to understand Recall and Precision in text

To finalize calculation we also need to calculate F1 scores (Harmonic mean) :

ROUGE-1

Consider the first candidate and the reference set :

Candidate 1 : Summarization is cool
Reference 1 : Summarization is beneficial and cool
Reference 2 : Summarization saves time

Overlapping words(unigrams) for reference 1 is more than reference 2. Basically we won’t make any calculations based on reference 2 for this candidate.

Recall = 3/5 = 0.6
Precision = 3/3 = 1

Rouge_1= 2*Recall*Precision/(Recall+Precision)= 2*(0.6)*(1)/((0.6)+1) = 0.75

Rouge1 score for candidate 1 is 0.75 we have to consider other candidate as well and calculate the mean of the each candidate’s rouge scores.

candidate 2 : I love Machine Learning
best reference : I think i love Machine Learning

Recall = 4/6 = 0.66
Precision = 4/4 = 1
Rouge_1 = 2*0.66*1/(1+0.66) = 0.795 (approximately)

candidate 3 : Good night
best reference : Good night everyone!

Recall = 2/3 = 0.66
Precision = 2/2 = 1
Rouge_1 = 2*0.66*1/(1+0.66) = 0.795 (approximately)

Mean of the F1 scores will gives us the full ROUGE-1 score for dataset.

Total Rouge1 Score= 0.795+0.795+0.75/3  = 0.78

Our hands on calculation is approximately same as code calculation. There is an error due to significants and this won’t effect the results.

ROUGE-2

It’s the same process as above but right now bigrams are determined.

candidate 1 : (Summarization is),(is cool)
reference 1: (Summarization is),(is beneficial),(beneficial and),(and cool)
reference 2: (Summarization saves),(saves time)

In terms of bigrams there is only one match in reference 1 and there is none at reference 2. Therefore we will take reference 1 as best reference for candidate 1.

Recall = 1/4 = 0.25
Precision = 1/2 = 0.5
Rouge_2 = (2*0.5*0.25)/(0.5+0.25) = 0.33

Calculation of the remaining candidates:

##candidate 2
candidate 2 : (I love),(love Machine),(Machine Learning)
best reference = (I think),(think i),(i love),(love Machine),(Machine Learning)
Recall = 3/5 = 0.6

Precision = 3/3 = 1
Rouge_2 = (2*1*0.6)/(1+0.6) = 0.75

##candidate 3
candidate 3 : (Good night)
best reference : (Good night),(night everyone!)
Recall = 1/2 = 0.5
Precision = 1/1 = 1
Rouge_2 = (2*1*0.5)/(1+0.5) = 0.66

Mean of the ROUGE-2-F1 scores :

Total Rouge 2 score : (0.66+0.33+0.75)/3 = 0.58

We can confirm that we calculated correctly as the code.

ROUGE-L

ROUGE-L is Longest Common Subsequence (LCS) oriented. LCS is the longest sequence of words that appear in both the candidates and reference summaries, while keeping the order of the words intact. It automatically includes longest sequence of words. it is important to note that LCSes are not necessarily consecutive but still in order. For example:

Model output: “A fast brown fox leaps over a sleeping dog.”

Reference summary: “The quick brown fox jumps over the lazy dog.”

To calculate ROUGE-L we first find the longest common subsequence between the two sentences. Longest common subsequencehere is “brown fox over dog.”

Next we compute the precision, recall, and F1-score based on this longest common subsequence:

Precision: Number of words in LCS / Number of words in the model output = 4 / 9 ≈ 0.444 
Recall: Number of words in LCS / Number of words in the reference summary = 4 / 9 ≈ 0.444
F1-score = (2 * 0.444 * 0.444) / (0.444 + 0.444) ≈ 0.444

So, in this example, the ROUGE-L score would be approximately 0.444.

THE BALANCE OF PRECISION AND RECALL

In the formula above, X is a reference summary sentence and Y is a candidate summary sentence with sizes m,n respectively. Beta is a parameter in the ROUGE-L score that controls the relative importance of precision and recall. When beta is 1 the ROUGE-L score is simply the harmonic mean of precision and recall. However, if beta is greater than 1, then recall is given more weight than precision.

ROUGE-W

We have mentioned that ROUGE-L does not differentiate LCSes based on consecutiveness. For example:

Y1 and Y2 have the same ROUGE-L score. However, Y1 should be better match than Y2 because it has consecutive matches. ROUGE-W improves the ROUGE-L by assigning more weight to consecutive LCSes. Note: I have not included the formulas for ROUGE-W in this article because they are quite complicated. Thus, you can find them in the ROUGE paper: https://aclanthology.org/W04-1013/

ROUGE-S

ROUGE-S stands for skip-grams. The order of the words in each sequence is preserved, but arbitrary gaps are allowed between words. For example with an arbitrary gap of 2 bigrams are:

Sentence: police killed the gunman

(“police killed”, “police the”,
 “police gunman”, “killed the”,
 “killed gunman”, “the gunman”)

Formula of ROUGE-S is almost same as ROUGE-L with an addition of combination function(C).

Advantages:

ROUGE is a well-established metric that has been shown to correlate well with human evaluation.
It is relatively easy to calculate and understand.
It is language independent. So it can be used to evaluate summaries in any language.

Disadvantages:

ROUGE only measures n-gram overlap. So it does not take into account the semantic meaning of the summary.
It is sensitive to the choice of reference summaries.
It can be biased towards summaries that are shorter or longer than the reference summaries

In this article, we presented a brief overview of ROUGE and its different variants. We also showed how to calculate ROUGE scores using the huggingface evaluate library. Finally, we discussed the advantages and disadvantages of ROUGE. We hope this article has been helpful in understanding ROUGE and its use in text summarization evaluation.