Discovering
Emerging Topics in Social Streams via Link-Anomaly Detection
ABSTRACT
Detection
of emerging topics is now receiving renewed interest motivated by the rapid
growth of social networks. Conventional-term-frequency-based approaches may not
be appropriate in this context, because the information exchanged in social network
posts
include not only text but also images, URLs, and videos. We focus on emergence
of topics signaled by social aspects of theses networks. Specifically, we focus
on mentions of users—links between users that are generated dynamically
(intentionally or
unintentionally)
through replies, mentions, and retweets. We propose a probability model of the mentioning
behavior of a social network user, and propose to detect the emergence of a new
topic from the anomalies measured through the model. Aggregating anomaly scores
from hundreds of users, we show that we can detect emerging topics only based
on the reply/mention relationships in social-network posts. We demonstrate our
technique in several real data sets we gathered from Twitter. The experiments
show that the proposed mention-anomaly-based approaches can detect new topics
at least as early as text-anomaly-based approaches, and in some cases much
earlier when the topic is poorly identified by the textual contents in posts.
Keywords: Topic
detection, anomaly detection, social networks, sequentially discounted
normalized maximum-likelihood coding,
burst detection
INTRODUCTION
COMMUNICATION over social networks, such as Facebook and Twitter,
is gaining its importance in our daily life. Since the information exchanged
over social networks are not only texts but also URLs, images, and videos, they
are challenging test beds for the study of data mining. In particular, we are
interested in the problem of detecting emerging topics from social streams,
which can be used to create automated “breaking news”, or discover hidden market
needs or underground political movements. Compared to conventional media, social
media are able to capture the earliest, unedited voice of ordinary people. Therefore,
the challenge is to detect the emergence of a topic as early as possible at a
moderate number of false positives.
Another
difference that makes social media social is the existence of mentions. Here,
we mean by mentions links to other users of the same social network in the form
of message-to, reply-to, retweet-of, or explicitly in the text. One post may
contain a number of mentions. Some users may include mentions in their posts
rarely; other users may be mentioning their friends all the time. Some users
(like celebrities) may receive mentions every minute; for others, being
mentioned might be a rare occasion. In this sense, mention is like a language
with the number of words equal to the number of users in a social network
Literature
Survey
1.
Goal
of Project:
The
main aim of this paper is
Ø Our
goal is to detect emerging topics as early as the keyword-based methods
Ø Our
goal is to evaluate whether the proposed approach can detect the emergence of
the topics recognized and collected by people.
2. Analysis on Existing Networks:
our basic
assumption is that a new (emerging) topic is something people feel like
discussing, commenting, or forwarding the information further to their friends.
Conventional approaches for topic detection have mainly been concerned with the
frequencies of (textual) words
Fig. 1
shows an example of the emergence of a topic through posts on social networks.
The first post by Bob contains mentions to Alice and John, which are both probably
friends of Bob, so there is nothing unusual here.
The second
post by John is a reply to Bob but it is also visible to many friends of John
that are not direct friends of Bob. Then in the third post, Dave, one of John’s
friends, forwards (called retweet in Twitter) the information further down to
his own friends. It is worth mentioning that it is not clear what the topic of this
conversation is about from the textual information, because they are talking
about something (a new gadget, car, or jewelry) that is shown as a link in the
text
Dis-Advantage
A term-frequency-based
approach could suffer from the ambiguity caused by synonyms or homonyms.
It may also
require complicated preprocessing (e.g., segmentation) depending on the target
language. Moreover, it cannot be applied when the contents of the messages are
mostly non textual information. On the other hand, the “words” formed by
mentions are unique, require little preprocessing to obtain (the information is
often
separated
from the contents), and are available regardless of the nature of the contents
3.Idea on proposed System:
we propose
a probability model that can capture the normal mentioning behavior of a user,
which consists of both the number of mentions per post and the frequency of
users occurring in the mentions. Then this model is used to measure the anomaly
of future user behavior. Using the proposed probability model, we can quantitatively
measure the novelty or possible impact of a post reflected in the mentioning
behavior of the user.
We aggregate
the anomaly scores obtained in this way over hundreds of users and apply a
recently proposed changepoint detection technique based on the sequentially
discounting normalized maximum-likelihood (SDNML) coding [3]. This technique
can detect a change in the statistical dependence structure in the time series
of aggregated anomaly scores, and pinpoint where the topic emergence is; see
Fig. 2. The effectiveness of the proposed approach is demonstrated on four data
sets we have collected from Twitter. We show that our mention-anomaly-based
approaches can detect the emergence of a new topic at least as fast as
text-anomaly-based counterparts. Furthermore, we show that in three out of four
data sets, the proposed mention-anomaly-based methods can detect the emergence of
topics much earlier than the text-anomaly-based methods, which can be explained
by the keyword ambiguity we mentioned above.
Results
Fig. 3a
shows that the proposed approach combined with SDNML-based change-point
analysis, and DTO correctly identifies the change point at 9:00, January 16,
for “Synthetic100” data set. We can clearly see that the proposed link-based
anomaly score (green curve in Fig. 3a) is low in the period Jan 11-Jan 15 and
high in the period Jan 16-Jan 20. The SDNML-based change-point analysis (the
blue curve in Fig. 3a) sharply rises at the change-point and goes down to zero
quickly. DTO converts the rise in change-point score into a binary sequence of alarms
(the red curve in Fig. 3a). The first detection time of SDNML+DTO was 9:00, Jan
16, ignoring the initial instability around Jan 11. Fig. 3b further
demonstrates that the proposed link-based anomaly score can be combined with
burst analysis. The two-state burst model correctly identifies the low state of
Jan 11-Jan 15 and the high state of Jan 16-Jan 20. The first detection time of
the burst approach was 9:01, Jan 16.
Figs. 4a
and 4b show the same plots for “Synthetic20” data set. Although the change in
the link-based anomaly score at Jan 16 was smaller because of the reduced
number of users who reacted to the topic, the proposed SDNML+DTO successfully
raised an alarm at 10:30, January 16, ignoring the initial instability around January
11. The burst-detection approach raised an alarm at 9:13, January 16, which was
earlier than the SDNML-based approach. The above results show that the proposed
approach can detect changes in the communication patterns of users even in a
realistic setting when only some part of the users react to the emerging topic.
Discussion
Within the
four data sets we have analyzed above, the proposed link-anomaly-based methods
compared favorably against the text-anomaly-based methods on “Youtube”, “NASA”,
and “BBC” data sets. On the other hand, the textanomaly- based methods were earlier
to detect the topics on “Job hunting” data set. The proposed link-anomaly-based
methods performed even better than the keyword-based methods on “NASA” and
“BBC” data sets.
The above
results support our hypothesis that the emergence of new topic is reflected in
the anomaly of the way people communicate to each other, and also that such emergence
shows up earlier and more reliably in the anomaly of the mentioning behavior
than the anomaly of the textual contents. This is probably because the textual words
suffer from variations caused by rephrasing
Compared to
the keyword-based methods, the above observation is natural, because for “Job
hunting” and “Youtube” data sets, the keywords seemed to have been unambiguously
defined from the beginning of the emergence of the topics, whereas for “NASA”
and “BBC” data sets, the keywords are more ambiguous. In particular, in the case
of “NASA” data set, people had been mentioning “arsenic”-eating organism
earlier than NASA’s official release but only rarely (see Fig. 7f). Thus, the
keywordfrequency- based methods could not detect the keyword as an emerging
topic, although the keyword “arsenic” appeared earlier than the official
release. For “BBC” data set, the proposed link-anomaly-based burst model
detects two bursty areas (Fig. 8b). Interestingly, the link-anomaly-based change-point
analysis only finds the first area (Fig. 8a), whereas the text-anomaly-based
methods (Figs. 8c and 8d) and the keyword-frequency-based methods only find the
second area (Figs. 8e and 8f). This is probably because there was an initial
stage where people reacted individually using different words, and later there
was another stage in which the keywords were more unified
No comments:
Post a Comment
Note: only a member of this blog may post a comment.