How Big Data is Reshaping Software Industry
You may confuse on the
word Big Data. So, what is big data?
Big data is considered as the data which are very large in size. Normally, we
measure data in Megabytes (MB) and Gigabytes (GB). But data in Petabytes (1015)
are called as big data.
The
Four V’s of Big Data
Big data poses
different challenges when designing algorithms or software systems that can
deal with them. Big data is exciting and can change health care policy
decisions and the way we do business. But to harness these benefits, we need to
address several challenges first. First of all, I’ll walk you through the four
V’s of big data which captures the challenges that you will face when involving
with big data. The four V’s refer to volume, velocity, veracity and variety.
1) Volume
Volume
in big data refers to a large amount of data that you have to deal with.
Nowadays, data is produced in a very large quantity. Let’s take surveillance
cameras installed in a major city as an example. The number of these cameras
might be in the thousands and each of them is providing a constant video stream,
resulting in massive amounts of data even within one day.
2) Velocity
Velocity
refers to the speed at which the data arrives. Again, if we consider
surveillance cameras, they provide data at a constant speed and often at high
resolution. This results in providing lots of data at high speeds. The internet
also provides a vast amount of data at very high speed. A company’s firewall
system has to monitor the high-speed data which that try to enter their
network. In the context of cyber security, it’s crucial to deal with this data
of high velocity and to make sure that it’s not a cyber-attack. Due to the high
velocity of data, it might not be feasible to store or check all of the data.
To cope with this issue, we look at sampling techniques that store a
representative fraction of the data.
3) Veracity
This
refers to the uncertainty of data. Often, data is not complete and can be
noisy. So you cannot be completely rely on all of the data that arrives because
there may be abnormalities within that data. When we take location services on
phones, if every user provides their location, then this location is usually
not precise. The data may not be complete, as the GPS coordinates cannot be
obtained at some locations.
4) Variety
Lastly,
the variety of big data refers to the different sources of data. Data can come
in various forms such as images, videos, audios, and sensor data and so on. For
a particular application, you might have to integrate data from various
sources.
Likewise, the four V’s
of big data are key to understanding the challenges in big data. Here are some more examples of different sources of
big data, and how you can analyze them with respect to the four Vs.
Social Media
There are millions of people using Facebook and Twitter. All the data is
produced in an online fashion arriving in the form of a data stream. Users post
a variety of data online on Facebook, such as text, images and videos.
Similarly, Twitter has short text messages. The data is high volume and arrives
with high velocity at the Facebook/Twitter servers. Users may be tagged by
their location using GPS coordinates. These coordinates are usually imprecise
leading to veracity of the data.
Fraud Detection in Banking Transactions
Banking produces millions of transactions per
day. These transactions have to
be processed safely and reliably. Thinking about a bank’s transactions over a
month results in a vast volume of data. Fraud detection refers to finding bogus
transactions that have been triggered by criminals. This can be by using a
stolen credit card or even only its details. You see that for fraud detection
you would have to deal with large volumes of data, each transaction arriving
rapidly, and a decision having to be made as soon as a transaction arrives. There
are some indicators that can be used to identify fraud, for example a credit
card used at an ATM in one country when all other transaction in the previous 2
days have been in another country. Finding frauds is hard and the
information used to stop a transaction is usually not 100% reliable. You might
even have observed this yourself when you tried to use your credit card in a
different country and the card was rejected although you were the legitimate
user of the card.
There are millions of people using
Skype. Skype offers various types of communication: text, voice call and video
call. It’s possible to send various different types of data into the text
messages (pdf files, images, videos, etc.) At any given moment millions of
users from various locations around the world can be using Skype. Doing so,
they produce a high volume of data that is arriving at the Skype server
rapidly. This data has a high variety in terms of the different forms used for
communication.
Online
Stores
Online
stores such as Amazon have millions of potential customers that buy a large
variety of items from their online servers. These customers produce a very
large number of transactions within a short time period. Mining these
transactions to extract useful information (for example to optimize
advertising) has to deal with the large amount of users and the variety of
items that they have bought. Making a recommendation to one particular user
takes into account what the user has bought so far. The knowledge gathered
about a customer is incomplete and a recommendation system has to rely on the
imprecise information that it can obtain from the transaction data of customers
and their behavior on the online store page.
Let’s take a break till my next post. Till then, happy coding 😊😊😊
really interesting dude! keep it up.
ReplyDelete