This week we're talking about Big Data.
Big Data is a lot like good art. It is hard to put into words, but you know it when you see it. Big Data means different things to different people. A bank, for example, may view big data as the millions of transactions that it processes every day. A scientist may view big data as the information collected during an experiment. A university may view big data as the information it must collect and retain about students, courses, faculty and staff, and the associated metadata that goes with all of that.
Big data is a big topic, so we'll cover some of the basics in this edition of Talking Technology.
The Five Vs of Big Data Let's start by examining the "Five Vs" of big data: volume, velocity, veracity, variety, and value. (This information is also available as a separate article as the Five Vs of Big Data.)
Volume is the amount of data in question. Terabytes, Petabytes, Exabytes and beyond. Sifting through large amounts of data requires different algorithms and techniques that the data processing methods of previous years.
The rate at which data is obtained can vary from a slow trickle that accumulates large volumes over time to massive influxes of data over shot periods of time. The more data you have coming in over short periods of time, the more you need to rely on techniques that sort and sift your data on the fly.
Veracity is the degree to which your data is valid for your purpose. If you are collecting weather data, are you using data from calibrated, approved weather stations? If you are collecting stock market data, does it come from a major exchange?
Big data comes in many varieties: social media streams, software, financial data, huge files, collections of small files, encrypted, compressed, etc. Optimizing your processes based on the type of data coming in (or going out) is essential.
At the end of the day all data is useless unless it provides value to your company. Months of Twitter comments take up vast amounts of storage space, but if your company has no Twitter presence, or does not utilize data from Twitter, what value is it to you? Decide what data is of value to you before you start collecting it. Searching for needle in a haystack is hard enough without making the haystack needlessly larger.
Of course, big data needs somewhere to live. Storage considerations for big data run the gamut from completely onsite to completely offsite. A popular choice is a hybrid cloud solution, where some data resides onsite and the rest is offsite. The flexibility of a hybrid cloud solution is part of the reason for its popularity. It allows you to keep seasonal or sensitive data onsite and the rest in the cloud. Cloud (offsite) storage is appealing economically because you do not need to maintain a data center. A hybrid solution can also be a good choice when you have an existing data center that has reached capacity. However, that cost savings is offset by the cost of transferring data and the possible delay in moving data around.
Cloud storage systems come in a variety of formats. One of the most popular is called S3, designed by Amazon. S3 stands for Simple Storage Service. The idea behind the S3 system is to achieve extremely high durability (99.999999999%) for data. The underlying structure of S3 is a data object. A data object is an immutable instance of data. Instead of changing data and saving it, a new data object is created. This allows for data provenance to be established as well as facilitating rollbacks of changes. S3 also allows for storing metadata along with the data.
A number of companies use the Amazon S3 system, including Netflix, Dropbox, Tumblr and Pinterest. S3 can be used for both pure cloud storage as well as hybrid systems. There are S3 compliant systems offered by other companies including Ceph, Apache CloudStack, DELL EMC Elastic Cloud Storage (ECS), DigitalOcean Spaces, IBM Bluemix and Cloud Object Storage, Pure Storage FlashBlade, and NetApp StorageGRID.
If you choose a hybrid solution, you should consider the time delay in moving data onsite. How long will it take
to transfer your big data? Here's an example calculation from Amazon:
Number of Days = (Size of your data in bytes) / (Megabits per second * 125 * 1000 * Network Utilization *
60 seconds * 60 minutes * 24 hours)
Network utilization is how much bandwidth you can use for this process. If you have only one network line
to the outside world, you can't use 100% of it or your business transactions will be squeezed out. If you
have a dedicated line for data transfer, on the other hand, you can use 100%. Your network engineers (or your
ISP's network engineers) can give you an idea of how much bandwidth you can use.
Let's say you have 200 Terabytes of information to transfer. The calculation for that a network connection
of 10 Gigabits/second that you could use 50% of to transfer data would be:
200,000,000,000,000 / (10,000 * 125 * 1000 * 0.50 * 60 * 60 * 24 = 3.7 days
If your transfer gets interrupted it will take even longer, especially if you have to restart from scratch. If time is a critical factor, you should explore leasing a high bandwidth connection for the time it would take to transfer your data. Leasing a 100 Gigabit line would make your transfer up to ten times faster than a 10 Gigabit line.
Storing and moving big data are just part of the bigger picture. How you process all that data is another issue. Do you create a data warehouse? A data lake? Do you use an extract-transform-load process to prep the data for analysis? What tools do you use for analysis? We'll look at those issues in Data Analytics.
If you have questions about moving your data and applications to the cloud, you can find an example of migrating to AWS in AWS Migration.
Data Age 2025 estimates that by 2025 global data will total 175 zettabytes. 1 zettabyte is 1 billion terabytes or 1 trillion gigabytes. That puts the BIG in big data.
Until next time, thanks for Talking Technology with me!
Copyright ©