Data Mining with Big Data
ABSTRACT
Big Data concern large-volume,
complex, growing data sets with multiple, autonomous sources. With the fast
development of networking, data storage, and the data collection capacity, Big
Data are now rapidly expanding in all science and engineering domains,
including physical, biological and biomedical sciences. This paper presents a
HACE theorem that characterizes the features of the Big Data revolution, and
proposes a Big Data processing model, from the data mining perspective. This
data-driven model involves demand-driven aggregation of information sources,
mining and analysis, user interest modeling, and security and privacy considerations.
We analyze the challenging issues in the data-driven model and also in the Big
Data revolution
INTRODUCTION
This is probably the most controversial
Nobel prize of this category. Searching on Google with “Yan Mo Nobel Prize,” resulted
in 1,050,000 web pointers on the Internet (as of 3 January 2013). “For all
praises as well as criticisms,” said Mo recently, “I am grateful.” What types
of praises and criticisms has Mo actually received over his 31-year writing career?
As comments keep coming on the Internet and In various news media, can we
summarize all types of opinions in different media in a real-time fashion,
including updated, cross-referenced discussions by critics? This type of
summarization program is an excellent example for Big Data processing, as the
information comes from multiple, heterogeneous, autonomous sources with complex
and evolving relationships, and keeps growing
2. Analysis on Existing Networks:
Our
capability for data generation has never been so powerful and enormous ever
since the invention of the information technology in the early 19th century. As
another example, on 4 October 2012, the first presidential debate between
President Barack Obama and Governor Mitt Romney triggered more than 10 million
tweets within 2 hours [46]. Among all these tweets, the specific moments that
generated the most discussions actually revealed the public interests, such as
the discussions about medicare and vouchers. Such online discussions provide a
new means to sense the public interests and generate feedback in realtime, and
are mostly appealing compared to generic media, such as radio or TV
broadcasting. Another example is Flickr, a public picture sharing site, which
received 1.8 million photos per day, on average, from February to March 2012 [35].
Assuming the size of each photo is 2 megabytes (MB), this requires 3.6
terabytes (TB) storage every single day. Indeed, as an old saying states: “a
picture is worth a thousand words,” the billions of pictures on Flicker are a
treasure tank for us to explore the human society, social events, public
affairs, disasters, and so on, only if we have the power to harness the
enormous amount of data
3.Idea on proposed System:
HACE Theorem. Big Data starts with
large-volume, heterogeneous, autonomous sources with distributed and decentralized
control, and seeks to explore complex and evolving relationships among data.
These characteristics make it an extreme challenge for discovering useful knowledge from the Big
Data. In a naı¨ve sense, we can imagine that a number of blind men are trying
to size up a giant elephant (see Fig. 1), which will be the Big Data in this
context. The goal of each blind man is to draw a picture (or conclusion) of the
elephant according to the part of information he collects during the process.
Because each person’s view is limited to his local region, it is not surprising
that the blind men will each conclude independently that the elephant “feels”
like a rope, a hose, or a wall, depending on the region each of them is limited to. To make the problem
even more complicated, let us assume that 1) the elephant is growing rapidly
and its pose changes constantly, and 2) each blind man may have his own
(possible unreliable and inaccurate) information sources that tell him about
biased knowledge about the elephant (e.g., one blind man may exchange his
feeling about the elephant with another blindman, where the exchanged knowledge
is inherently biased). Exploring the Big Data in this scenario is equivalent to
aggregating heterogeneous information from different sources (blind men) to
help draw a best possible picture to reveal the genuine gesture of the elephant
in a real-time fashion. Indeed, this task is not as simple as asking each blind
man to describe his feelings about the elephant and then getting an expert to
draw one single picture with a combined view,
concerning that each individual may speak a different language (heterogeneous
No comments:
Post a Comment
Note: only a member of this blog may post a comment.