Sent
Eurostar is a high-speed railway service that connects London with major Eu-ropean cities, e.g. Amsterdam, Brussels, Lyon, Paris and Rotterdam. Ali the trains traverse the Channel Tunnel between the United Kingdom and France.
The company collects feedback from its passengers through multiple chan-nels but one of the most popular is Twitter. Although it is easy for travelers to express their dissatisfaction on social media, it is hard to address their issues and infer meaningful insights. The current process of addressing the passengers feedback consumes a lot of time sińce it is being done manually by a Eurostar employee.
A rangę of major passenger problems has been successfully identified. In automatically created groups of tweets the following issues could be easily recognised:
• Internet connection/wifi related complaints,
• compensation and refund queries,
• online booking error reports,
• immediate help requests,
• and direct messages (DMs) inquiries.
““"Snbsi §""**■* Mp
deiices ES»50 Help
Paweł Mielniczuk O
@TheRealMielniczukPawel
@Eurostar Third time this month riding your train from Paris to London and again no WiFi!
#Eurostar #angry #dissapointed #noWifi
*'•! cant 4-raitact tryoi
tjirostarnlfssey
anjwhere
gettmg
laies
£ E59I48
lic.tMtter.cini/SuiKPuTtC
Frwre
• starek bft*w, ! air/thirg lajltę
terrible
works ^
post
ESSIS8 time
gettiig
es9l28
MEYER
a- Jns bother
falseatfrtrtisiig
■ —_ ES09S3 ‘^5 s
sericusljf alnost
aa w ui mii^ ter
Iteier preterd E.ery
5:11 PM 04 Dec 18
JK^JSiiuissue
Mlii.*
rp.ply1
&
nessages -
France
the utilg trjnv | ,-rL—
M Jeurostarfr ™2:i n\m
j.£w ca
tao
ca
13RETWEETS 15 LIKES
IB
The main goal of this project was to be able to automatically identify the most common types of passenger problems sent in through Twitter messages and filter out SPAM and non-important issues.
Data used in the project has been collected directly from Twitter using their public API in November 2018. Fetched tweets consist of not only messages sent in by Eurostar passengers but also Eurostar employee responses. Even though many of them express gratitude and happiness from their travels, the vast majority of messages are considered to be either complaints or direct questions regarding Eurostar train services. Over 290 thousands of tweets were collected in total.
0 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 HOUfS
urgeit
hello respaiid
e,rtn,t
ttiin fmlt* flEASE
g;S instaf ood! F ŃT H £ f 3 PSyiStS
a | ir.
delish i JlStoryr g; Pl
— ~ m. S --=rWatch V=tresli =
travelpics
smile^
Dozens of SPAM messages were also filtered out as well as a good amount of tweets containing words of thanks. There was plenty of tweets with @Eurostar mention that did not include any relevant information nor requests. Uploaded photos of happy passengers during their travels were not taken into account either and such messages were simply ignored.
Methods & Tools
The finał solution was implemented in Python 3 using i.a. scikit-learn, pandas, spaCy, textacy, NumPy and NLTK.
English tweets
Since Eurostar supports multiple languages English tweets must have been exlractcd.
Data collected from ©Eurostar Twitter account over the period of 6 years.
sweet
Takeaways
OD
tali « W
someone
Clustering
GloN/e tralned on Twitter data was used to convert text to vcctors. The following clustering methods were used: KMeans, DBSCAN and Agglomerative Clustering.
©
Data cleanlng
Emojis. numbers. timestamps. train numbers. urls. mentions epfacements are only a few preprocessmg methods that have been used to clean raw data, Cities and train station names used by Eurostar have also been handled.
• Planning data cleaning process ahead of time is difficult due to its highly con-textual process and may require morę iterations than assumed.
• Twitter data contains a lot of clutter. Some tweets do not include any rele-vant information, hence many of them cannot be addressed automatically.
• There is no straightforward way of automatic evaluation of clusters.
• DBSCAN algorithm is not well suited to perform clustering on data with vary-ing density like Twitter messages, but may perform really well in filtering out the noise.
• Agglomerative clustering gives much better results compared to K-Means or DBSCAN sińce it is able to analyse deeper every cluster yielding morę detailed categories.
The clustering algorithms were run on a set of tweets containing passengers’ questions as well as on direct Eurostar responses. Obtained clusters using both methods were later compared and unfortunately no significant improve-ments were noticed in clustering by answers over clustering by questions.
Scan this QR codę for an online version of this poster: