Extracting Pre-Defined Themes By Processing Data Using Classification Models

Contributed By: SAURABH SETHI

BACKGROUND: I am Saurabh Sethi. I have 11+ years of experience in a variety of fields. Being an early boomer, I started to explore opportunities from call centers wherein I learned to keep up efficient communication convert sales, and build relationships. With the experience of team building and good communication, I was hired by Genpact to support payment collection for a healthcare provider wherein I was exposed to a variety of different metrics and built my interest in data. Eventually, I had a couple of internal movements and had a chance to experiment with provider and billing data and participated in a Master Data Management Project. With high ambition, I continued my career by joining ATCS Inc. to pilot social media analytics and listening for global brands, which drew me to the heart of analytics, and now I am awaiting my Data Science PG degree.

PROBLEM STATEMENT: In the process of offering Social Media Listening and digital strategies, we use publicly available social media post feeds based on mentions to unearth the hidden secrets that will drive strategies for our partnering brands. And this leads us to the difficulty of dealing with highly dispersed and qualitative data, which necessitates a significant amount of manual effort slicing and dicing through thousands of contextual data points to uncover themes and patterns to build on inferences.

GOAL STATEMENT: Create a supervised classification model trained on a certain topic to extract pre-defined themes by processing millions of data rows accounting for social media users and sarcasm.

TECHNIQUES USED: Using historical data on specialized themes, we constructed a Supervised Classification model with regression-based Support Vector Machine technique on cleaned and tokenized contextual data via Natural Language Processing, and deployed it on a React Native application.

OBSERVATIONS: Using the techniques learned in the training, we discovered a number of abnormalities and redundancies in the data due to some dominating discussions from influential social media accounts, which opened up another use case around author segmentation and mapping.

SOLUTION: We successfully deployed the classification model on a frontend application, allowing users to categorize the social media feeds into pre-defined labels, removing the time-consuming process of manually reading and segmenting the conversation. This enabled the digital analyst to quantify the various talking points and go further into the data to find the main problems and opportunity areas for the brand. The model is currently configured using SVM regression equations, which provide an accuracy of 92% and process 1 million rows of contextual data points in about 5 minutes.

“Automation is cost-cutting by tightening the corners and not cutting them.” – Haresh Sippy