To analyse the sentiments of a WhatsApp chat, I have collected the data from my personal WhatsApp chats. To collect the data of your chats, simply follow the steps mentioned below:
Iโve started this task by defining some helper functions because the data we get from WhatsApp is not a dataset that is ready to be used for any kind of data science task.
import re
import pandas as pd
import numpy as np
import emoji
from collections import Counter
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Extract Time
def date_time(s):
pattern = '^([0-9]+)(\\/)([0-9]+)(\\/)([0-9]+), ([0-9]+):([0-9]+)[ ]?(AM|PM|am|pm)? -'
result = re.match(pattern, s)
if result:
return True
return False
# Find Authors or Contacts
def find_author(s):
s = s.split(":")
if len(s)==2:
return True
else:
return False
# Finding Messages
def getDatapoint(line):
splitline = line.split(' - ')
dateTime = splitline[0]
date, time = dateTime.split(", ")
message = " ".join(splitline[1:])
if find_author(message):
splitmessage = message.split(": ")
author = splitmessage[0]
message = " ".join(splitmessage[1:])
else:
author= None
return date, time, author, message
In this step, It doesnโt matter if you are using a group chat dataset or your conversation with one person. All the functions defined above will prepare the data for sentiment analysis.
data = []
conversation = 'WhatsApp Chat with Sapna.txt'
with open(conversation, encoding="utf-8") as fp:
fp.readline()
messageBuffer = []
date, time, author = None, None, None
while True:
line = fp.readline()
if not line:
break
line = line.strip()
if date_time(line):
if len(messageBuffer) > 0:
data.append([date, time, author, ' '.join(messageBuffer)])
messageBuffer.clear()
date, time, author, message = getDatapoint(line)
messageBuffer.append(message)
else:
messageBuffer.append(line)
Now here is how we can analyze the sentiments of WhatsApp chat using Python:
df = pd.DataFrame(data, columns=["Date", 'Time', 'Author', 'Message'])
df['Date'] = pd.to_datetime(df['Date'])
data = df.dropna()
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiments = SentimentIntensityAnalyzer()
data["Positive"] = [sentiments.polarity_scores(i)["pos"] for i in data["Message"]]
data["Negative"] = [sentiments.polarity_scores(i)["neg"] for i in data["Message"]]
data["Neutral"] = [sentiments.polarity_scores(i)["neu"] for i in data["Message"]]
print(data.head())
Date Time Author ... Positive Negative Neutral
0 2022-04-06 01:00 am Kirti ... 0.0 0.000 1.000
1 2022-04-06 01:02 am Kirti ... 0.0 0.000 1.000
2 2022-04-06 01:06 am Shivam ... 0.0 0.000 1.000
3 2022-04-06 01:07 am Kirti ... 0.0 0.383 0.617
4 2022-04-06 01:12 am Shivam ... 0.0 0.000 1.000
Now, letโs compare the cost of acquisition across different channels and identify the most and least profitable channels:
x = sum(data["Positive"])
y = sum(data["Negative"])
z = sum(data["Neutral"])
def sentiment_score(a, b, c):
if (a>b) and (a>c):
print("Positive ๐ ")
elif (b>a) and (b>c):
print("Negative ๐ ")
else:
print("Neutral ๐ ")
sentiment_score(x, y, z)
Output:
Positive ๐
By far, the data I used indicates that most of the messages between me and Kirti are positive.