Recent advancements in large language models (LLMs) have led to an impressive capability to generate text that closely mimics human writing. To assess the similarity between human and machine-generated texts, researchers have developed various metrics, and improving these measures is a key focus within the field.
Evaluating Semantic Similarity
One method of evaluation involves comparing a reference, human-written text with the output from a language model. BERTScore is one such metric that gauges the semantic similarity by calculating cosine similarities between token embeddings, offering a nuanced understanding of textual parallels.
Understanding BERTScore
For instance, when comparing the reference sentence “the weather is cold today” with the machine-generated “it is freezing today,” traditional n-gram based metrics may rate similarity low, despite obvious semantic congruence. BERTScore addresses this by evaluating the contextual embeddings of each token in both texts to better capture their semantic similarity.
BERTScore then allows for the calculation of precision, recall, and F1 scores by averaging the maximum cosine similarities for tokens in the reference and candidate texts, respectively. This approach provides a more accurate reflection of the model’s performance in generating human-like text.
Furthermore, BERTScore has been enhanced with “importance weighting,” a modification that recognizes the significance of rare words shared between sentences, adding further refinement to the evaluation metric.