April 8, 2022 • 4 min read

Summarization with Transformers: Setting up for Success

Rédigé par Achille Huet

Achille Huet


Transformer models are known to have the best performance when it comes to complex language tasks such as summarizing texts. Like humans, these models are capable of paraphrasing complicated sentences into short phrases which capture the original text’s main ideas and meaning.

Furthermore, Hugging Face’s Python library makes these models quick and easy to implement and use for diverse projects. However, I’ve found that it’s not always easy to find the best model for your use case, and depending on your dataset, these models may not give the best results right out of the box.

There are several things you can do to ensure you start with the model that is the most relevant for you. And in case where performance is still not sufficient, there are some additional tweaks that we can make to find the best setup for our use case.

In this article, I’ll be presenting how you can quickly find the best setup for your summarization project.

What are Transformers ?

Transformers are the building blocks for powerful, high-performance language and time-series models. They were introduced in 2017 in the paper Attention is all you need, and later used in 2018 to produce the first Transfomer-based model : BERT.

Transformers models have since then become the reference in terms of NLP, as their complex structure and attention mechanism allows them to understand the workings of spoken and written language.

Summarization models

The Hugging Face library offers many different models to choose from, for handling different languages and use cases.

For summarizing English text, we can choose for example BART or PEGASUS, which are the 2 most used models for this task. They have similar architectures but the methods used to train them are very different, which is why they both have their strong points and their weaknesses: while BART is excellent at extracting the best keywords, PEGASUS produces more fluid, natural-sounding sentences.

There also exist a few variations of these models, which may be more suited to different use cases. For generating summaries in other languages, you can for example use BARThez for french, BARTpho for vietnamese, or MBart-50 for other languages.

Getting a good pre-trained model

One of the common mistakes I see when using transformers models is the tendency to use the default pretrained weights given by the Hugging Face documentation. In the case of BART, this is the facebook/bart-large model, which has been pretrained on the cnn-dailymail dataset. This means that this model has been trained to write summaries of news articles, so it probably won’t perform as well on other tasks like email summarization.

Identifying the best pre-trained model for your use case may increase your performance by several points, and save you many hours of tweaking and fine-tuning to get the results you want.

This is why I recommend that you start with these 2 simple steps to quickly find a good pre-trained transformers model:

1 - Find existing datasets that best represent the real-life data you are working with

If you are trying to summarize emails, then working with a model pretrained on news articles may not give the best results. This is because the sentence structure and keywords are very different between these 2 formats. Instead, you’ll want to to work with a model trained or fine-tuned on a dataset of emails such as aeslc.

Hugging Face has provided access to a large number of datasets, which you can filter to quickly identify a list of datasets designed for summarization.

The list of summarization datasets available on Hugging Face

In the case of summarization for news articles, there are many available datasets : cnn-dailymail, xsum, gigaword, ..., and each has its specificities. For example, cnn-dailymail gives for each article a long summary, around 4 to 5 sentences, while xsum gives very short summaries.

This means that models pretrained on cnn-dailymail will be better at producing longer summaries, and models pretrained on xsum will be better on short, 1 sentence summaries. This illustrates why it is so important to select a pretrained model which corresponds to your needs.

2 - Find model weights that have been fine-tuned on your chosen datasets

Hugging Face provides a list of available pretrained model weights, which you can filter to keep only the models you want - for example, models trained for summarization.

We can also filter this list to see the models which have been fine-tuned on a specific dataset.

Most models will have metrics for their performance on the chosen dataset, so it is usually quite simple to compare them. However keep in mind that the given metrics may not reflect the real performance of these models on your data.

Performance of sshleifer distilbart models comapred to fabeook/bart-large model

Generating a first summary

For my use case, I want to find a catchy sentence to explain the goal of this article, so I’ve extracted the first part of this article to summarize it.

I want to generate a short summary, so I’ll use a model fine-tuned on the xsum dataset, which contains summaries similar to what I want. After searching through the models, I’ve selected sshleifer/distilbart-xsum-6-6, which has a good performance.

Let’s load our pre-trained model :

This will download the model if you don’t already have it stored in your cache.

Next, we encode our text and generate the corresponding summary :

Note: if you use tensorflow, you can generate tensorflow tensors instead of pytorch tensors using return_tensors=”tf”

The generated sequence is still in a one-hot encoding format, so we need to decode it to get the text output :

This summary is already quite good, but I’m not completely convinced by the wording. Let’s try to improve it !

Tweaking your model

To get better results from a summarization model, there are several parameters which you can adjust :

  • min_length: the minimum number of tokens that an output text can have. Punctuation counts as a token, and some words may be made up of more than one token, so this should be slightly more than the number of words you want
  • max_length: the maximum number of tokens that an output text can have
  • num_beams: the number of different possible sequences considered at each generation step (see beam search for more details). This increases computation time but also increases the quality of the generated text.
  • top_k : only the most probable top_k words are considered for each generation step. This avoids having very improbable words pop up during text generation
  • no_repeat_ngram_size : avoids repetition of n_grams (sequence of n consecutive words). This is useful when producing longer texts, as models sometimes repeat themselves : in this case I suggest using a value of 3 or 4 to ensure diversity without hurting performance.

For my use case, I wanted to ensure that I had a short text with high quality, so I used the following parameters :

The result is amazing ! I’ve even used this generated summary as a caption for this article.


Identifying the ideal model for your task is essential in machine learning, and this also applies to Transformers models. The Hugging Face library provides access to hundreds of models, but picking the most relevant one is a difficult task in itself.

These models can also be tweaked to fit your specific needs, and although I’ve covered some of the most important parameters, the model.generate() function still has many others that you can play around with.

If you’re looking for more resources to learn about Transformers, head over to the Hugging Face website, which has great tutorials for data scientists !

Cet article a été écrit par

Achille Huet

Achille Huet