We are living in the midst of unprecedented generation of data with volumes galloping exponentially. An information overload indeed. Every day we create approximately 2.5 quintillion bytes of data globally. One Quintillion is one million raised to the power of five. Truly mind-blowing! By 2020, it's estimated that 1.7 MB of data will be created every second for every person on earth. Looked at from another perspective, as much as 90% of all data generated in human history has been created in the last 2 years alone. No wonder the colossal mass of data in currency is called “Big Data”.
Not only the size of the data, but its variety, speed and multiplicity of sources are of the scale unimaginable just a decade ago. Big Data, the gargantuan mass of scattered and diverse data mix, is characterized by the three Vs – Volume, Variety and Velocity. Gartner, the leading global research and advisory firm on Technology, defines Big Data as data that contains greater variety arriving in increasing volumes and with ever-higher velocity. The ongoing data revolution rides on the advancements made in connectivity and processing capacity.
The spread of internet and the advent of mobile communications have put data creation and applications on top gear. In 2014, there were 2.4 billion internet users. As of June 2019 there are over 4.4 billion. There has been near doubling of the number of people using the internet in just five years! With the ever-growing popularity of social media and the variety of mobile apps for micro-messaging emerging in succession, the overall size of Big Data is bound to keep growing. A whole new discipline has emerged to make sense of the mountainous data bulge out there in the cyber world.
The purpose of data sciences is to find value, meaning and applicability of the massive amount of unstructured and dispersed data - both humanly created and generated by embedded intelligence such as sensors and devices. In the post-Industrial Revolution phase, which was marked by massive manufacturing activities, business was driven by the traditional factors of production-land, labour, capital and enterprise. However, in the wake of the Digital Technology revolution that followed the dawn of the internet, the pre-eminence of land and labour has yielded way to the new critical elements of the business eco-system - Intellectual Property and Data.
The Digital Age businesses are not geography dependent. With increasing deployment of Artificial Intelligence, Machine Learning, Robotics and Internet of Things, there is reducing dependence on physical labour. There is a significant paradigm shift in the operating model, when it comes to the digitalized business compared to the traditional way of manufacturing or trading. Data and analytics are central to the new ways of doing business. Comprehending the enormity of the Big Data environment needs careful consideration of the three Vs, mentioned above.
Volume is massive and intriguing aspect of the data-driven business process. With big data, we need to process high volumes of low-density, unstructured data. This may include data from social media feeds, clickstreams on a webpage or mobile app and inputs from connected devices and nodes. Velocity refers to the rate at which data is received and acted on. Some smart technologies feed data in real time and this will require real-time analysis and follow-up, as in the case of a chemical plant that operates on smart technologies and generates real-time data leading to real-time, and often automated interventions.
The best practices in Quality Management such as Zero-Defect program are immensely supported by IoT-enabled real-time data feeds. A wide variety of data are available across domains which do not fit neatly in a relational database. Unstructured and semi-structured data types would require specialized tools for analysis and interpretation. The evolution and refinement of tools and applications for efficient and incisive data analysis are engaging the attention of researchers and developers across domains. Social Media has come to occupy centre-stage in contemporary communication. Naturally, with its massive data resources, social media attracts critical attention from businesses and Governments.
Apart from enormous size, the mainly user-generated social media content is noisy and unstructured, with underlying social relations as its base. Naturally the data generated by social media involves privacy, sensitivity and emotiveness; they are generally meant for closed loops. While there is enormous value in social media feeds, if curated and analyzed with sensitivity, there are serious challenges facing social media analytics. The tapping of social media especially for commercial analytics, attracts ethical, legal and regulatory issues. Social media mining has fast evolved into a specialized field of data analytics.
It uses sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in the huge, unstructured data mass. Big Data is so voluminous that traditional data processing software just can’t manage them. Finding value in big data is a complex discovery process that requires asking the right questions, recognizing patterns, making meaningful assumptions, and predicting the behavior of variables. Because of its sheer size and complexity, we need to utilize a distributed computing environment to store, access and analyze Big Data. It is safer, more efficient and faster to do so than relying on centralized computing.
That is the principle on which programming frameworks like Hadoop work. Hadoop is an open-source, Java-based program ideally suited for processing large data sets. It runs on distributed systems with thousands of nodes and can handle petabytes of information. A petabyte is a unit of information equal to one thousand million million bytes (10 raised to the power of 15). This would give an idea of the massive computing power involved in finding meaning in the chaos of Big Data. Advanced tools like Spark have emerged in the search for sharpening the power to do extremely high database computing. While Hadoop has high volume capability it has relatively high latency, rendering it inefficient to deal with need for instantaneous response;
Spark is a low latency computing and therefore can be effective in interactive computing. (Higher latency implies corresponding lag in response time; lag time of even nanoseconds might be material, as in the case of ultra-highspeed equity, commodities or currency trading) Strategic importance for commercial and Governmental purposes makes Big Data Analytics a critical competency in the digital domain. Without the power and range of analytics, all the unorganized data generated day in and day out across domains and devices are of little use. Predictive Analytics sharpens the strategic edge of big data applications.
It has wide applications in several domains ranging from manufacturing, marketing and financial services to public policy and service initiatives of Governments. Predictive analytics identify the likelihood of future outcomes based on historical data. The utility of predictive analytics in business economics as well as policy planning cannot be overstated. A specific use case could be improving the quality of weather forecasts and as a consequence upgrading the disaster management preparedness. Similarly, in health and education, predictive analytics can provide meaningful basis for planning targeted public policy interventions.
In the product development domain, predictive analytics can throw light on the potential market needs. Social Media Analytics would, for example, uncover likely trends in lifestyle-specific and demography-related market needs. Banking and Financial Services sector relies on Predictive Analytics for both fraud and risk management. In these days of high credit vulnerabilities and strain on the financial health of lending institutions, the utility of predictive analytics has assumed utmost criticality. Insurance industry is essentially dealing with probability of risks and hence predictive analytics is an invaluable input for devising market-focused plans and for aligning products and tariffs to anticipated changes in market conditions.
In addition to detecting claims fraud, the health insurance industry can, for example, identify patients most-at-risk of chronic diseases and plan customized products using predictive analysis outcomes. While big data holds a lot of promise, it is not without its challenges. Firstly, big data is too big. Data volumes are doubling in size about almost every two years. Organizations struggle to keep pace with their data and find ways to effectively store it. Cloud storage capacities would need to be scaled up substantially and secured. Secondly, it is not enough to just store the data. We should be able to discover value in the data sets and that depends on curation. Clean data - or data that’s relevant and organized requires a lot of work. Data Sciences is emerging as one of the critical areas of massive skill shortage anticipated over the next decade. Cyber Security and Data Analytics would attract a lot of talent over the next five to ten years.
Finally, big data technology is changing rapidly. Hadoop and Spark, to mention two of the raging favorites, are only stepping-stones to the emergence of even more powerful tools. The future of Big Data is likely to see virtualized solutions and huge demand for cloud storage capacity. Big Data is not just redefining the future of computer. It is pervading every aspect of computing and discovering newer and more powerful applications. For granular insights into business, customer behaviors, economic trends or Governance policies, to name just a few of the strategic information needs, Big Data is crucial. As an idea, Big Data is truly transformational due to its impact in shaping the future of businesses and economies.
*Ravi Kumar Pillai is CEO and Principal Consultant, Cherrypick India Consulting and Business Solutions Private Limited, Trivandrum and can be contacted at cherrypickindia@ gmail.com.