8 min read
AlphaSense AI Research: Aspect-based summarization and WikiAsp
June 23, 2021
8 min read
Every day, knowledge workers receive a deluge of information that is critical to their work. Company-issued documents, market research, and news all contain insights required to make the best possible decisions. And because of the sheer amount of content, it’s not realistic to read everything, and some key information may be missed.
However, recent advances in Natural Language Processing (NLP) and Deep Learning (DL) research allow for automated extraction and summarization of information from documents, and can help humans to process overwhelming information in a more efficient way. It’s part of the AI technology that powers AlphaSense (for example, the AlphaSense Language Model), and we’re constantly looking for ways to improve the technology to empower our users even further.
The AlphaSense AI research team collaborated with Carnegie Mellon University to explore novel text summarization methods, to understand current challenges, and apply what we learned to next-generation algorithms to more effectively summarize financial documents.
In this research, we focus on aspect-based, multi-document summarization to generate summaries organized by specific aspects of interests in multiple domains. Such summaries can help improve efficient analysis of text, such as quickly understanding reviews or opinions from different angles and perspectives.
In AlphaSense, imagine a user searching for ‘737 max’. At the peak of the crisis at Boeing, there were thousands of new and relevant documents flowing through the platform every week. A useful summarizer might extract key points from those documents and organize them based on terms (designated as aspects for the purposes of our research and this article) like:
- impact to the business
- regulatory risk
- potential litigation
- reaction of airlines
- CEO commentary
As a result of our research, we have released a multi-domain aspect-based summarization dataset, WikiAsp, to advance the research in this direction. The baseline evaluation and insights gained on the WikiAsp dataset are summarized in the recently published paper “WikiAsp: A Dataset for Multi-domain Aspect-based Summarization”, which appears in the Transactions of the Association for Computational Linguistics, a publication by MIT Press.
Aspect-based summarization is a type of summarization task that aims to provide targeted summaries of a document from different perspectives and elements of the document that are considered depending on the context of a given document. For example, when looking at a TV product review, there are different aspects to consider — image, sound, connectivity, price, and more (see some examples in the OpoSum dataset.) The aspects for summarization vary for different domains, while the development of previous models has tended to be domain-specific. This research aims to apply a generic method to summarize documents from varying domains and with multiple aspects.
For this purpose, we built an open-source dataset from Wikipedia articles called WikiAspect or WikiAsp, for short. Wikipedia is an easily accessible source of multi-domain multi-aspect summary content, serving as a very serviceable dataset as it includes articles across many domains. Its sectional structures i.e. the section boundaries and titles in each article form natural annotations of aspects and the corresponding text. Wikipedia articles require that a document’s content is verifiable from a set of references, so citations should contain the majority of the information in the articles. In WikiAspect, we use cited references of a Wikipedia article as input context and construct sets of “aspects” from its section titles through steps of automatic extraction, curation, and filtering. The section texts then serve as corresponding aspect-based summaries.
The WikiAsp dataset consists of instances in 20 domains where each domain has 10 pre-defined aspect classes. This is a significantly larger dataset as compared to previously collected aspect-based summarization datasets, as shown in the below table. Previous datasets contained 1-7 domains within their dataset, less than 50% compared to WikiAsp. The document length is also significantly larger – 13,672 words compared to the next largest document length – 2,369 words.
After collecting the WikiAsp dataset, we wanted to gain more insights into aspect-based summarization, to identify major challenges and potential solutions. For this purpose, we experiment with baseline methods on WikiAsp. We devised a two-stage approach: aspect discovery and aspect-based summarization, as shown in the below figure.
The first stage is aspect discovery which classifies sentences in cited references to aspects. We use a fine-tuned ROBERTa model for the aspect discovery which is capable of predicting not only the aspects for a given domain but also the general relevance of the sentences in the first place.
The second stage is aspect-based summarization, of which there are two types: extractive summarization which selects important content from original text as summary, and abstractive summarization, which generates new text as summary. Both an extractive method (TextRank) and an abstractive method (BERTSum) are tested on the labeled and grouped sentences to perform supervised aspect-based summarization.
Through numerous experiments, we performed an extensive analysis of performance across different genres and aspect types, revealing the unique challenges in the multi-domain and multi-document setting.
First, there are imbalanced training samples among aspects, so the aspect classification has high recall but low precision. This means that we need to carefully balance the training of different aspects. Second, we use ROUGE score, which measures the overlap between generated summaries and the ground truth, to evaluate the summarization quality. We find that both baseline abstractive and extractive methods have low ROUGE scores, see the Experiments. This shows that the multi-domain/aspect summarization problem is indeed a challenging task, which requires advanced algorithms.
We also observe that any aspects requiring summarizing contents in a particular temporal order (e.g. time series events) adds extra difficulty because of the need to correctly order scattered and possibly duplicate pieces of information from different sources. Certain domains that involve interviews or quotes from people also exhibit challenges in correctly modifying pronouns based on the relationship to the topic of interest.
Through this research, we built WikiAsp, a large-scale, multidomain multi-aspect summarization dataset derived from Wikipedia. WikiAsp invites a focus on multiple domains of interest to investigate and solve some of the various problems in text summarization and provides a testbed to develop and evaluate new text summarization algorithms. Through our initial benchmarking, we identified challenges to address in future work and research.
As the next step, we plan to apply this two-stage method to summarize financial documents, where it is common to see multiple domains and multiple aspects. Although the fully automated algorithmic summarization at a domain expert level still has a long way to go, we can leverage some of the developments from this work. For example, we plan to use abstractive summarization methods to highlight important aspects of a company and associated sentences in an earnings call, so readers can quickly understand what is happening with the company without having to read an entire earnings call transcript.
More like this
7 min read
3 min read