Why (and how) we created our unique theme extraction algorithms

Topic extraction from text-based documents holds the possibility of massive time-savings, but it’s difficult to do with high accuracy. What’s AlphaSense’s approach to topic extraction? What leads to our higher accuracy than other tools? Read on to find out.


Every earnings season, AlphaSense users–some of the smartest and busiest analysts and strategists in the world–are expected to have granular and current knowledge about dozens of companies. Historically, this has meant many tedious hours spent identifying the nuances revealed in the latest Earnings calls and determining how they are going to impact that company’s strategy.

At AlphaSense, we understand that our clients are looking to get ahead of the market, and that therefore, time is of the essence. Being able to extract the most important topics from transcripts, in the context of key metrics like quarter-over-quarter increase in mentions, positive/negative sentiment, or overall mentions, is a powerful way to highlight the ‘aboutness’ of a transcript without having to read it line by line. The value is that it contains the possibility of massive time savings for users while also identifying the most relevant information at scale across every single competitor’s transcript.

Though topic extraction is not new, its value-add in the context of transcripts requires a nuanced understanding of financial language. Since language used in earnings calls can often be convoluted and contain boilerplate language that provides no information of import, generic open source models won’t contain relevant results. Other challenges with open source models include the fact that different domains often have their own specialized jargon, a generic open source model won’t get the best results. Many open source models limit their output to 50 – 100 topics, which considering the length of some of our transcripts (100+ pages) is not sufficient to capture all relevant information..

Luckily, at AlphaSense, we have already spent years investing in the world’s highest accuracy sentiment analysis model for financial language. Building upon our existing body of AI expertise, we’ve developed a proprietary tool called Document Themes to help our clients quickly excavate signals buried in documents like transcripts.

Document Themes is the culmination of many high-impact AI initiatives that AlphaSense has spearheaded:

  • Sentiment Analysis. Deep-learning based sentiment analysis identifies, quantifies, and analyzes language in text-based documents as positive, negative or neutral with over 90% accuracy to help investors and business professionals catch the subtle inflection points in language that move markets.
  • Dynamic Natural Language Processing (NLP). AlphaSense’s NLP-driven theme extraction via algorithms that understand financial language, and historical context intelligently captures the full discussion around each theme. Advanced exclusions ensure generic and obvious themes are removed from results.
  • Custom Multi-Level Ranking Algorithm. This is designed using graph-based topic ranking combined with custom features that list document topics in order of their relevance to the original document, uniqueness and importance to end users.
  • Topic clustering. We use our proprietary smart synonym technology and phrase level word embeddings trained internally to cluster similar topics together. This reduces redundancy, allows users to see a diverse set of the most important topics discussed in a transcript, and also gives a more accurate representation of the topic’s counts and other features, which ultimately improves the importance ranking of topics.
  • The value-add of clustering to topic extraction is easily seen. In a standard setup, the topics `coronavirus` and `covid-19` would be two separate items in the topic list. But, using our smart synonyms and adding the knowledge gained from phrase level embeddings, we are able to cluster these synonymous concepts together into a single topic.

To get started with Document Themes,

Tanvi Sahay is an Artificial Intelligence Researcher at AlphaSense Inc. A graduate of UMass Amherst, she focuses on creating interpretable, customer focused solutions to NLU and Information Extraction problems while listening to anime soundtracks. 

Tanvi Sahay
Tanvi Sahay

Tanvi Sahay is an Artificial Intelligence Researcher at AlphaSense Inc. A graduate of UMass Amherst, her focus is on creating interpretable, customer focused solutions to NLU and Information Extraction problems using traditional machine learning and deep learning algorithms.

Read all posts written by Tanvi Sahay