“Big Data”- some Infrastructural Issues

By Administrator 123erty

Published On: May 16, 2012Categories: Faculty Blog0 Comments

Perhaps formally a precise definition of “big data” has not yet evolved. Nevertheless, the problem of managing the big data has been faced for some time now by the CIOs and the managers of large datacenters. Earlier the high volume transactions (Airlines, Banks, Financials, etc, ) demanded more of processing speed and the resulting updates did not expand the size of the database or storage in an unmanageable velocity. The term big data usually means when any enterprise application generates bulk of data of different variety which contains critical business information and size of that is too large to be processed by conventional relational databases. Today the business are generating more data than before due to Internet transactions, social networking activity, mobile devices, NFC, automated sensors and scientific instrumentation, etc. The challenge, therefore, is not only in the storage front but also how to structure it enabling analysis of the critical business information contained in the big data.

How big is Big Data ?

What size of the big data could be understood from the National Institute of Health’s “International 1000 Genomes project” data. At 200 terabytes ( the equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs), the current 1000 Genomes Project data set is a prime example of big data, where data sets become so massive in various mode that few researchers have the computing power to make best use of them. Fortunately this is publicly available at Amazon Web Services (AWS).

Big Data has its impact on the US Federal Government to consider them to launch “Big Data Research and Development Initiative” – a programme “focused on improving the U.S. Federal government’s ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most pressing challenges.”[1]

In this new trend big data has upset the obvious provisioning in the infrastructure by forcing new development of storage, networking and SW systems addressing the need to handle these specific new challenges. As the software requirements drive hardware to a new cutting edge, in a similar manner perhaps the need for the big data analytics has impact on the emergence of the storage infrastructures. This is an obvious opportunity for the storage, WAN optimization and other IT infrastructure companies. The current storage system designs are unable to handle the growing needs of big data for mostly unstructured and some conventionally in structured and mode. Here’s a listing of some of the characteristics the big data infrastructural requirement that need to be incorporated to meet the challenges in the scenario of big data emergence.

As the flood of unprecedented data sets are sweeping across the variety of processes in the business world, the IT industry is required to keep the infrastructure tuned up to face the challenges of Big Data, for achieving competitive advantages through analysis . Question naturally comes ,which factors matter most in a big data infrastructure. While going over the details there could be so many, however, the main ones worth for discussion are classified, in this blog, under technical and commercial, as shown below:

Technical aspects concerning Big data

(i) The “Big Data” conveys capacity for petabytes of data to be handled. For that the infrastructure needs to scale up, with ease. As the data grows the incremental size of the storage has to be in modules transparent to the user . The new technology trend in the Scale-out’s clustered architecture, features nodes of storage capacity with embedded processing power. With robust connectivity these can grow seamlessly, avoiding the silos of storage. Big data demands handling of a large number of files. Object oriented storage architecture promises expanding the file counts into the many without suffering the overhead problems that traditional file systems encounter. Object-based storage systems can also scale geographically, enabling large infrastructures to be spread across multiple locations.

(ii) Big data infrastructure may also have a real-time component, especially in Web transactions. Latency issues for analytics in the real-time environment is sure to deteriorate the performance creating “stale- data” . For example, tagging Web advertising to each user’s browsing history requires real-time analytics. Storage systems must be able grow in proportions to those requirements, while maintaining performance because latency can produce “stale data.” Here, too, scale-out architecture, object based storage enable the cluster of storage nodes to increase in processing power and connectivity improving though put.

(iii) As the business are getting aware of the promises of big data analytics, the need to compare differing data sets is obvious. This brings more users (allowing multiple users on multiple hosts ) to access and sharing this mine of big data. In order to extract the business value from the big data , analytics looks for exploring more ways to cross-reference different data objects from different platforms. Access and security issues, thus, come to the forefront. Financial data, medical information and government intelligence carry their own security standards and requirements. While these may not be different from what current IT managers must accommodate, big data analytics may need to cross-reference data that may not have been addressed for security earlier. This may need some new security aspects to be resolved.

(iv) Big data environments will nevertheless need to provide high IOPS performance. Server virtualization ,for example call for high IOPS requirements, just as it is in conventional IT environments. To meet these challenges, solid-state storage devices can be implemented in many different formats, from a simple server-based cache to all-flash-based scalable storage systems.

Commercial aspects in Big Data

(i) “Big data” can also lead to big cost. What could be the steps towards cost containment being searched along with performance criteria. Data deduplication has become already an operations feature, economizing space and time. Thus, depending on the data types involved, this could bring some cost relief to big data storage environment. The ability to reduce capacity consumption on the back end, even by a few percentage points, can provide a significant return on investment as data sets grow. Thin provisioning, snapshots and clones may also provide some efficiencies depending on the data types involved.

(ii) Many big data storage systems will include an efficient archive system as well particularly for those organizations (like, Banks, Government ) who need long term retention often referring historical data. Some applications involve regulatory compliance that dictates data be saved for years or decades. Medical information is often saved for the life of the patient. Financial information is typically saved for seven years. But big data users are also saving data longer because it’s part of an historical record or used for time-based analysis. This requirement for longevity means storage manufacturers need to include on-going integrity checks and other long-term reliability features, this all involves cost. Tape is still the most economical storage medium and archiving systems started incorporating multi-terabyte cartridges suitable for big data environment.

(iii) What may have the biggest impact on cost containment is the use of commodity hardware. Many of the first and largest big data users have developed their own “white-box” systems that leverage a commodity-oriented, cost-saving strategy. But more storage products are now coming out in the form of software that can be installed on existing systems or common, off-the-shelf hardware. As a business requirement, big data will eventually reach the smaller organizations requiring big data analytics. This may provide new opportunities for the storage vendors taking in view of the cost aspects..[2]

Summary:

IT has evolved from EDP when data used to be created manually and fed to the system for processing. Challenges were speed and accuracy. With the advent of Big Data the main challenges are in the domain of storage so that analytics do not fail to trace relevant some data sets residing across the geographical reach for cross references. To quote a statement of Dr. John Holdern, Director of the White House Office of Science and Technology , “….the initiative (Big Data Research and Development Initiative) we are launching today promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security.”[1]

Contributed by: Prasenjit Sen

Reference : [1] http://www.forbes.com/sites/reuvencohen/2012/05/13/the-white-house-is-spending-big-money-on-big-data/

[2] blog of Eric Slack. Eric slack http://searchstorage.techtarget.com/feature/What-to-consider-when-choosing-a-big-data-infrastructure?asrc=EM_NLT_17285926&track=NL-57&ad=871129&