“Data is the new science. Big data holds the answers” – Pat Gelsinger, CEO at VMWare
Big Data consists of a vast volume of both structured and unstructured data. It can be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people across the world from different sources like web, sales, customer contact center, social media, mobile data and so on.
Big Data’s significance doesn’t revolve around how much data you have, but what you do with it…
Data from any source when captured, formatted, manipulated, stored and then analyzed, can help a company to find solutions that increase revenue, cost and time reduction, new product development, gain or retain customers, intelligent decision making and improved operations.
With humongous data being collected, understanding the particulars of Big Data has become more than a necessity. This list, from the basics to the advanced, of Big Data terms and definitions, would serve as a guide for beginners.
Are you interested to know more? Then peek in!
A mathematical/analytical formula or statistical process implemented in software to analyze, process the input and produce output/results on a set of data. In Big Data, Algorithm is an effective categorical specification of how to solve a complex problem and at the same time, how to perform data analysis. For example, it is a set of rules given to an AI, neural network, or other machines to perform – classification, clustering, recommendation, and regression.
Data Analysis and Analytics
In simple terms, the discovery of insights in data. Analytics helps us sort out meaningful data from data mass. Data analytics is a broader term in which data analysis forms a subcomponent. Data analysis refers to the process of compiling and analyzing data to support decision making, whereas data analytics includes the tools and techniques to do so.
Let’s consider a layman’s example.
When you receive your year-end credit card statements listing transactions for the entire year, you calculate what % spent on food, clothing, entertainment, and others. That is ‘analytics’.
If it is 25% on food, 35% on clothing, 20% on entertainment and the rest on miscellaneous items then that is ‘Descriptive Analytics’.
‘Predictive Analytics’ is you analyze credit card history for the past 5 years. If the split has been somewhat consistent then you can forecast with a high probability that next year’s spending will be almost similar. Similarly, data scientists use advanced techniques like machine learning, or statistical to forecast the weather, crop yield, economic changes, etc.
You may want to find out which category to target next year to make a huge impact on your overall spending, i.e. reduce food or clothing or entertainment. ‘Prescriptive Analytics’ helps you make data-driven decisions by looking at the impacts of various actions.
Keen to know more terms? Read on…
Ever wondered how Google throws ads about products/services that may just be on your mind?
It is about sensing our web surfing patterns, social media interactions, our e-commerce actions and connecting these unrelated data points to purchasing capabilities. Behavioral Analytics is about why, how, and what consumers & applications do instead of just the who and when. It enables the marketers to make right offers to the right customers at right time.
Call Detail Record (CDR) Analysis
This includes metadata about phone calls – when, where, and how calls are made, who called whom, type of call ( Inbound, Outbound or Toll-free), call duration & time when the call was made, how much the call costs (on the basis of per minute rate). It helps businesses with required details for billing and reporting purposes.
This is a systematic process for obtaining important and relevant information about data and metadata.
This refers to analyzing users’ online activity like items that user clicks on a web page while surfing through the internet.
Big brother knows what you are clicking and keeps following you even when you switched websites!
This is an explorative data mining analysis of identifying data objects that are similar to each other and cluster them in order to understand similarities and the differences within the data. It is also known as segmentation analysis or taxonomy analysis.
It is a common strategy to analyze statistical data in various fields – image analysis, pattern recognition, machine learning, computer graphics, data compression and so on. More specifically, it tries to identify homogenous groups of cases, i.e., observations, participants, respondents. The different cluster analysis methods can handle binary, nominal, ordinal, interval or ratio data.
A step-by-step procedure of comparing multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics. For example, it can be used in healthcare to compare large volumes of medical records, documents, and images for more accurate medical diagnoses.
If you have seen a spiderweb-like chart with interrelated connections between people, products, and systems within a network or multiple networks then that is called connection analytics.
EPA is the first step in data analysis, implemented before any formal statistical techniques are applied. It helps the analyst to get a “feel” for the data set and to determine what the most important elements in the data set are.
Graph analytics, also known as network analysis is application of graph/chart theory to categorize, understand, and determine the strength and direction of relationships between different data points.
It is a new era for analytics workloads used for social network influencer analysis – who are potential targets for marketing campaigns to trigger chain reactions among social network communities to buy products and services. It is also used in detecting financial crimes, cybercrime attacks, conducting research in life sciences as well as for supply distribution chains and logistics.
This is the process of analyzing, interpreting, and gaining insights from the geographic component or location of business data.
Object-based Image Analysis
This is the analysis of object-based/digital images along with data taken by selected individual pixels, known as image objects or simply objects.
It refers to the process of optimization during the design cycle of products done by algorithms. Product team can virtually design many different variations of a product and test that product against pre-set variables.
Parallel Data Analysis
It is the process of breaking an analytical problem into small portions and then running analysis algorithms on each of the portions simultaneously. This type of data analysis can be run either on the different systems or on the same system.
It is a process used in databases that make use of SQL to determine how to further optimize queries for performance.
It is a process to find the optimized routing with the use of various variables for transport to improve efficiency and reduce costs.
It is the process of analyzing spatial data such as geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space. This analysis helps to identify and understand everything about a particular area or a position.
It is a process that involves capturing and tracking opinions, emotions or feelings expressed by consumers in various types of interactions or documents, including social media, calls to customer service representatives, surveys and the like. Algorithms are used to determine or assess the sentiments or attitudes expressed toward a company, product, service, person or event.
Two activities under this analysis are – Text analytics and natural language processing.
The text analytics is used to derive insight or meaning from the text data by applying linguistics, machine learning, and statistical techniques on the text-based sources.
Natural Language Processing
Software algorithms that make computers accurately understand everyday human speech, allowing human beings interact more naturally and efficiently with these computers.
Time Series Analysis
It is analyzing well-defined data obtained through repeated measurements at successive points in time spaced at identical time intervals.
Topological Data Analysis
This refers to focusing on the shape of complex data and identifying clusters and any statistical significance that is present within that data.
It is a process or procedure to track the risks of an action, project or decision. The risk analysis is done by applying different statistical techniques on the datasets.
Unstructured Data Analysis
Unstructured data analysis refers to the process of analyzing data objects that are stored over time within an organizational data repository and don’t follow a predefined data model /architecture and/or is unorganized.
Computer System Resources and Programming
Apache Software Foundation (ASF) provides many Big Data open source projects.
- Apache Kafka: Kafka, named after that famous Czech writer, is used for building real-time data pipelines and streaming apps. It enables storing, managing, and processing of streams of data in a fast and fault-tolerant way. Kafka gained its popularity with social network environment dealing with streams of data. It is a LinkedIn product.
- Apache Mahout: Mahout is an open-source data mining library of pre-made algorithms for machine learning and data mining. It uses data mining algorithms for regression testing, performing, clustering, statistical modeling, and then implementing them using MapReduce model.
- Apache Oozie: Oozie provides a workflow system to schedule and run jobs in a predefined manner and with defined dependencies for Big Data jobs written in languages like Pig, MapReduce, and Hive.
- Apache Drill, Apache Impala, Apache Spark SQL: If you know SQL and have an understanding of how data is stored in big data format (i.e. HBase or HDFS), you can easily use these frameworks. They provide quick SQL like interactions with Apache Hadoop data.
- Apache Hive: This data warehouse software project built on top of Apache Hadoop that helps in data query, reading, writing, analyzing, and managing large datasets residing in distributed storage using SQL.
- Hadoop User Experience (Hue): Hue is an open-source interface that makes it easier to use Apache Hadoop. It is a web-based application and has a file browser for HDFS, a job designer for MapReduce, an Oozie Application for making coordinators and workflows, a Shell, an Impala, Hive UI, and Hadoop APIs.
- Apache Pig: Pig is a high-level platform for creating query execution routines on large, distributed data sets that run on Apache Hadoop. The scripting language used is called Pig Latin. The pig executes Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
- Apache Sqoop: This is a command-line interface application for transferring data between relational databases and Hadoop developed by Apache.
- Apache Storm: This is a free and open-source real-time distributed stream processing computation framework. It is written predominantly in the Clojure programming language. It helps to process unstructured data with instantaneous batch processing using Hadoop.
- Apache Flume: It is a simple and flexible architectured open-source software for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It is robust and fault-tolerant with tunable reliability mechanisms for failover and recovery.
- Apache Flink: It is an open-source streaming data processing framework that is written in Java and Scala and is used as a distributed streaming dataflow engine.
- Apache NiFi: It is an open-source Java server that enables the automation of data flows between systems in an extensible, pluggable, open manner.
- Apache HBase: This is the Hadoop database which is an open-source, scalable, versioned, distributed and big data store. Some features of HBase are
- Modular and linear scalability
- Easy to use Java APIs
- Configurable and automatic sharing of tables
- Extensible JIRB shell
Artificial Intelligence (AI)
Aren’t all these trending technologies connected to each other?
AI refers to the process of developing intelligence machines and software that can perceive the environment and are able to perform tasks that otherwise require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. It is the intelligence demonstrated by machines.
This is a process of a higher level of abstraction for Hadoop. It allows developers to create complex jobs easily using several different languages that run on the JVM, Ruby, Scala, and more.
This is a functional programming language constructed in LISP which uses the JVM (Java Virtual Machine and is used in parallel data processing.
Are you looking for a free and open-source, distributed, wide column store, NoSQL database management system?
Apache Cassandra is a most popular DBMS with the above features that are designed to handle large amounts of data across many commodity servers providing high availability with no single point of failure.
A Hadoop sub-project which is devoted to a large-scale analysis and collection of logs. It is developed over the HDFS (Hadoop distributed filesystem) and MapReduce and comes with Hadoop’s strength and scalability. It also comprises a powerful and flexible toolkit for observing the display and analyzing the results to make the most out of the collected data.
Cloud Computing essentially means “software and/or data hosted and running on remote servers and accessible from anywhere on the internet”. It offers IaaS, PaaS, and SaaS. Flexible scaling, rapid elasticity, resource pooling, on-demand self-service are some of its services. Running applications and storing data in the cloud often provides organizations with significant cost savings and operational simplicity.
It’s a method of computing using a ‘cluster’ of pooled resources of multiple servers.
This refers to performing computing functions with resources from several distributed systems.
ETL stands for extract, transform, and load, i.e, the process involves ‘extracting’ raw data, ‘transforming’ by cleaning/enriching the data(‘fit for use’) and ‘loading’ into the appropriate repository for the system’s use.
Big data and Hadoop have become too mainstream!
Hadoop forms a part of the Apache project sponsored by the Apache Software Foundation. It is an open-source Java-based programming framework that consists of what is called a Hadoop Distributed File System (HDFS). It allows storage, retrieval, and analysis of very large data sets using a distributed computing environment. You can run this application on systems with several commodity hardware nodes, handle thousands of terabytes of data, lower the risk of catastrophic system failure and unexpected data loss.
Distributed File System
As big data is too large to store on a single system, the Distributed File System is a data storage system meant to store large volumes of data across multiple storage devices. Usage of DFS helps in decreasing the cost and complexity of storing large amounts of data.
It is an open-source, distributed, low latency SQL query engine for Hadoop that can handle fixed schemas. It is designed for semi-structured or nested data. The drill is similar in some aspects to Google’s Dremel and is handled by Apache.
A unique technique to move the working datasets entirely within a cluster’s collective memory. This helps in avoiding writing intermediate calculations to disk. Apache Spark is an in-memory computing system and it has huge advantage in speed over I/O bound systems like Hadoop’s MapReduce.
IoT(Internet of Things)
IoT is the system of interconnected computing devices generating huge amounts of data. IoT possesses the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.
A highly effective way performing data analysis. Computer systems can learn, adjust/correct the behavior and insights, and keep improving based on the data fed to them using predictive and statistical algorithms. Machine learning mechanizes logical model building and trusts the ability of the system to adapt.
This programming model breaks the big data dataset into pieces so they can be distributed across different computers in different locations(essentially Map part). Then this model collects the results and ‘reduces’ them into one report. MapReduce’s data processing model goes hand-in-hand with Hadoop’s distributed file system.
MongoDB is a cross-platform, open-source database that uses a document-oriented data model and not a traditional table-based structure. This database designed in such a way that makes the integration of structured and unstructured data in certain types of applications easier and faster. It uses JSON documents to save data structures with an agile scheme known as a MongoDB BSON format. It integrates data in applications very quickly and easily.
NoSQL actually stands for Not ONLY SQL!
It is a database management system that is designed to handle large volumes of data not having a structure or a ‘schema’. It is often well-suited for big data systems because of the flexibility and distributed-first architecture needed for large unstructured databases. A NoSQL database is not built on tables, and it doesn’t use SQL for the manipulation of data.
NewSQL is a class of modern relational database management system which provides the scalable performance same as NoSQL systems for OLTP read/write workloads. It is a well-defined database system that is easy to learn.
Python is an easy, flexible, general-purpose, object-oriented programming language that emphasizes code readability in order to allow programmers to use fewer lines of code to express their concepts. Python is more productive compared to other programming languages like Java or C.
You ain’t a data scientist if you don’t know ‘R’!
It is one of the most popular languages in data science and works very well with statistical computing.
HANA or High-performance Analytical Application is a software/hardware in-memory platform developed by SAP. It is designed for high volume data transactions and analytics in real-time.
Hama is basically a distributed computing framework for big data analytics based on Bulk Synchronous Parallel strategies for advanced and complex computations like graphs, network algorithms, and matrices. It is a Top-level Project of The Apache Software Foundation.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a primary data storage layer used by Hadoop applications. It employs DataNode and NameNode architecture to implement distributed and Java-based file system which supplies high-performance access to data with high scalable Hadoop Clusters. It is designed to be highly fault-tolerant.
Quantum computing is defined as the study of a currently hypothetical model of computation based on the principles of quantum theory, which explains the nature and behavior of energy and matter on the quantum (atomic and subatomic) level. The emergence of quantum computing is based on a very different form of data handling to perform calculations that could be called non-binary, as it has more than two possible values.
Software-as-a-Service enables vendors to host an application and make it available via the internet over the cloud.
Structured Query Language (SQL)
SQL is a standard programming language that is used to retrieve and manage data in a relational database. This language is very useful to create and query relational databases.
It is a software framework that is used for the development of ascendable cross-language services. It integrates a code generation engine with the software stack to develop services that can work seamlessly and efficiently between different programming languages such as Ruby, Java, PHP, C++, Python, C# and others.
A commercial data visualization package often used in data science projects.
TensorFlow is a free software library focused on machine learning created by Google. Initially released as part of the Apache 2.0 open-source license, TensorFlow was originally developed by engineers and researchers of the Google Brain Team, mainly for internal use.
Vector Markup Language (VML)
Vector Markup Language (VML) refers to an application of XML 1.0 that defines the encoding of vector graphics in HTML.
XML-Query Language & Database
It refers to a specific query and programming language for processing XML documents and data. XML Databases allow data to be stored in XML format.
It is an Apache software project and Hadoop subproject which provides open code name generation for the distributed systems. It also supports the consolidated organization of the large-sized distributed systems.
Zend Optimizer refers to an open-source runtime application used with file scripts encoded by Zend Encoder and Zend Safeguard to boost the overall PHP application runtime speed.
Zachman Framework refers to a visual aid for organizing ideas about enterprise technology.
Data about Data – It’s Data Everywhere!
Automatic Identification and Capture (AIDC): a broad set of technologies used to glean data from an object, image, or sound without manual intervention.
ACID Test: stands for atomicity, consistency, isolation, and durability of data.
Aggregation: a process of searching, gathering, analyzing and presenting the data.
Batch Processing: a standard computing strategy that involves processing data in large sets. Hadoop uses batch processing.
Business Intelligence (BI): infrastructure and tools, applications, and best practices that enable access to and analysis of data to improve and optimize decisions and performance.
Biometrics: the process of using analytics and technology in identifying people one or more of their physical traits – face recognition, iris recognition, fingerprint recognition, etc. It is most commonly used in modern smartphones.
Blob storage: an Azure service that stores unstructured data in the cloud as an object or as a blob.
COAP(Constrained Application Protocol): is an Internet Application protocol for limited resource devices that can be translated to HTTP if needed.
Data Analyst: he or she deals with collecting, manipulating and analyzing data in addition to preparing reports.
Dark Data: refers to all the data that is gathered and processed by enterprises, may never be analyzed nor used for any meaningful purposes and remain ‘dark’. It could be social network feeds, call center logs, and meeting notes.
Data Warehouse: They are repositories for enterprise-wide data – but in a structured format after cleaning and integrating with other sources.
Device Layer: the entire range of sensors, actuators, smartphones, gateways, and industrial equipment for sending data streams across the enterprise depending on their performance characteristics and environment where these data sets work.
Data lake: a large repository of enterprise-wide data in the ‘raw format’ and you must really know what you are looking for and how to process it as these data sets are not in a structured format.
Data Mining: it is about finding meaningful patterns and deriving insights in large sets of data using sophisticated pattern recognition techniques – statistics, machine learning algorithms, and artificial intelligence.
Data Modelling: is defined as the analysis of data objects using data modeling techniques to create insights from the data.
Structured v Unstructured Data: Structured data can be put into relational databases and organized in such a way that it relates to other data via tables. Unstructured data – email messages, social media posts and recorded human speech can’t be organized.
Data Cleansing: deals with detecting and correcting or removing inaccurate data or records from a database. Using a combination of manual and automated tools and algorithms, data analysts can correct and enrich data to improve its quality.
DaaS: DaaS which stands for Data-as-a-Service and it helps in getting high-quality data quickly by giving on-demand access to cloud-hosted data to customers.
Data Virtualization: Did you know our photos are stored on social networks using the method of data virtualization? This is an application of data management that allows an application to retrieve and manipulate data without requiring technical details of where it stored and how it is formatted.
Dirty Data: data that is not clean or in other words inaccurate, duplicated and inconsistent data.
Data Engineering: the collection, storage, and processing of data so that it can be queried by a data scientist.
Data Flow Management: the process of ingesting raw device data, while managing the flow of thousands of producers and consumers. Then performing basic data enrichment, analysis in stream, aggregation, splitting, schema translation, format conversion, and other initial steps to prepare the data for further business processing.
Data Governance: the process of managing the availability, usability, integrity, and security of data within a data lake.
Data Integration: the process of integrating data from different sources, networks, systems and providing a unified view for the user.
Data Operationalization: the capabilities that can consistently sift through your large volumes of data, strictly define variables into measurable factors, and deliver actionable information to business stakeholders to drive better outcomes. It defines fuzzy concepts and allows them to be measured, empirically and quantitatively.
Data Preparation: the process of collecting, cleaning, and consolidating data into one file or data table during the data analysis process.
Data Processing: the process of retrieving, transforming, analyzing, or classifying data by a machine.
Data Science: a multi-disciplinary field that uses scientific methods, repeatable processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Data Scientist: Let’s talk about a career that is HOT! It is someone who can make sense of big data by extracting raw data, massage it, and come up with insights. Some of the skills required for data scientists are – analytics, statistics, computer science, creativity, story-telling and understand business context.
Data Swamp: As companies collect increasing amounts of data and store it, sometimes, what started as a data lake turns into a data swamp without proper governance.
Data Validation: The act of examining data sets to ensure that all data is clean, correct, and useful before it is processed.
Data Custodian: in data governance groups, a person who has administrative and/or operational responsibility for data.
Data feed: a stream of data such as a Twitter feed or RSS.
Data measurement(memory or data storage):
Terabyte: a relatively large unit of digital data. One Terabyte (TB) equals 1,000 Gigabytes.
Yottabytes: approximately 1000 Zettabytes, or 250 trillion DVD’s.
Zettabytes: approximately 1000 Exabytes or 1 billion terabytes.
Exabytes: 1 exabyte is equal to 1,000 petabytes and precedes the zettabyte unit of measurement.
Brontobyte: there are approximately 1,024 yottabytes in a brontobyte.
Petabyte: it equals to 1,024 terabytes or 1 million gigabytes.
Terabytes: approximately 1000 gigabytes.
Econometrics: the application of statistical and mathematical theories in economics for testing hypotheses and forecasting future trends. Econometrics makes use of economic models, tests them through statistical trials and then compares the results against real-life examples. It can be subdivided into two major categories: theoretical and applied.
Fuzzy logic: a kind of computing meant to mimic human brains by working off of partial truths as opposed to absolute truths like ‘0’ and ‘1’ like rest of boolean algebra. Heavily used in natural language processing, fuzzy logic has made its way into other data related disciplines as well.
Failover: the capabilities to automatically and seamlessly switch to a highly reliable backup upon the failure of a primary server, application, or system.
Fault-tolerant design: a system designed to continue working even when certain parts fail.
Feature Engineering: the machine learning expression for a piece of measurable information – height, length, and breadth of a solid object. It is one of the most effective ways to improve predictive models using Feature Reduction & Feature Selection.
Frequentist Statistics: a procedure to test the probability or improbability of a hypothesis. The sampling distributions of fixed size are taken. Then, the experiment is theoretically repeated an infinite number of times under the same conditions but practically done with a stopping intention.
Gamification: the application of game elements and digital game design techniques in non-game contexts. It can be used to interact with customer, change the behavior of consumers and improve marketing efforts that lead to increased sales. Or for the internal crowdsourcing activities and employee productivity.
Graph Databases: Ever wondered how Amazon tells you what other products people bought when you are trying to buy a product? Yup, Graph database! It uses concepts such as nodes and edges representing people/businesses and their interrelationships to mine data from social media.
Imputation: the technique used for handling missing values in the data using statistical metrics like mean/mode imputation or by machine learning techniques like kNN imputation.
Inferential Statistics: refers to mathematical methods that employ probability theory. This method is used to deduce the properties of a population from the data sample collected. Inferential Statistics also deals with the precision and reliability of the inferences it helps to draw.
IQR: it is a measure of variability, based on dividing a rank-ordered data set into quartiles, i.e, into four equal parts and they are denoted by Q1, Q2, and Q3, respectively.
Juridical Data Compliance: commonly used in the context of cloud-based solutions, where the data is stored in a different country or continent. Data storage in a server or data center located in a foreign country must abide by the data security laws of the nation.
Load balancing: distributing workloads across multiple computers or servers in order to achieve optimal results and better utilization of the system.
Metadata: basic information about a document, an image, a video, a spreadsheet or a web page, which makes working with particular instances of data easier. For example, author, date created, date modified and file size are metadata of a document.
Stream Processing: real-time and streaming data with “continuous” queries. Combined with streaming analytics.
Multi-Dimensional Databases: a database optimized for data OLAP applications and for data warehousing. For example, ‘MultiValue Database’ is a type of NoSQL and multidimensional database that understands 3-dimensional data directly.
Neural Network: a biologically-inspired programming paradigm that enables a computer to learn from observational data. Deep learning is a powerful set of techniques for learning in neural networks.
Pattern Recognition: it is closely linked and even considered synonymous with machine learning and data mining. It is a result that occurs when an algorithm locates recurrences or regularities within large data sets or across disparate data sets. Such inference helps researchers discover insights or reach conclusions that would otherwise be obscured.
RFID: a type of sensor using wireless non-contact radio-frequency electromagnetic fields to transfer data. With IoT revolution, RFID tags can be embedded into every possible ‘thing’ to generate a massive amount of data that needs to be analyzed.
Semi-structured Data: refers to data that is not captured or formatted in conventional ways, such as those associated with traditional database fields or common data models. It is also not raw or totally unstructured and may contain some data tables, tags or other structural elements. Examples of semi-structured data are Graphs and tables, XML documents.
Smart Data: it is supposedly the data that is useful and actionable after some filtering done by algorithms.
Visualization: it consists of complex graphs that can include many variables of data while still remaining understandable and readable. raw data can be put to use with right visualizations.
KeyValue Databases: also known as a key-value store. It is the most flexible type of NoSQL database with no schema and the value of the data is opaque. The value is identified and accessed via a key. The stored value can be a number, string, counter, JSON, XML, HTML, binary, image, or short video.
K-Means: a type of unsupervised learning that is used for segregating unlabeled data. This algorithm works iteratively to assign each data point to one of K-groups based on the key features.
K-nearest Neighbors: a simple algorithm that stores all available cases and classifies new cases based on a similarity measure. It was used in statistical estimation and pattern recognition during the 1970’s as a non-parametric technique.
Latency: Latency refers to delays in transmitting or processing data – network latency or disk latency.
Linear Regression: a kind of statistical analysis or using a predictive model, that looks at various data points, shows a relationship between two variables, and plots a trend line. For example, it can be used in showing trends in cancer diagnosis data or in stock prices.
Logistic Regression: refers to a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome.
Location Data: Location data refers to the information collected by a network or service about where a user’s phone or device is located.
Logfile: a file that maintains a registry of events, processes, messages, and communication between various communicating software applications and the operating system.
Logarithm: refers to an exponent used in mathematical calculations to depict the perceived levels of variable quantities such as visible light energy, electromagnetic field strength, and sound intensity.
Mashup: a method of merging different datasets into a single application and is used for visualization. For instance, combining job listings with demographic data.
Munging: the process of manually converting or mapping data from one raw form into another format for more convenient consumption.
Normalization: the process of reorganizing data in a database to meer two basic requirements: (1) There is no redundancy of data, and (2) data dependencies are logical.
Parse: the division of data, such as a string, into smaller parts for analysis.
Pattern Recognition: an algorithm locates recurrences or regularities within large data sets or across disparate data sets.
Persistent storage: a non-changing place, such as a disk, where data is saved.
Quad-Core Processor: a multiprocessor architecture that is designed to provide faster processing power.
Query: a request for data or information from a database table or combination of tables in the form of results returned by SQL or as pictorials, graphs, complex results, trend analyses from data-mining tools.
Quick Response Code: a type of two-dimensional barcode that consists of square black modules on a white background majorly designed to be read by smartphones.
Serialization: the standard procedures for converting data structure or object state into standard storage formats.
Tag: a piece of information that describes the data or content that it is assigned to. Tags are nonhierarchical keywords used for Internet bookmarks, digital images, videos, files and so on.
Taxonomy: the classification of data according to a pre-determined system with the resulting catalog. It provides a conceptual framework for easy access and retrieval.
Transactional Data: transactions of the organization and includes data that is captured. This data can change unpredictably. Examples include when a product is sold or purchased, accounts payable or receivable, or product shipments.
Vector Markup Language (VML): an application of XML 1.0 that defines the encoding of vector graphics in HTML.\
Vertical Scalability: the addition of resources to a single system node, such as a single computer or network station, which often results in additional CPUs or memory.
Web Application Security: the process of protecting confidential data stored online from unauthorized access and modification.
Whiteboarding: the manipulation of digital data files on a visual digital whiteboard.
Xanadu: a hypertext/hypermedia project.
X Server: a server program that connects X terminals running on the X Window System, whether locally or in a distributed network.
X Client: the application program that is displayed on an X server. Apache, OpenOffice, gFTP, gedit, GIMP, Xpdf, and rCalc are typically X clients if employed on such operating systems.
X Terminal: an input terminal with a display, keyboard, mouse, and touchpad that uses X server software to render images.
YMODEM: refers to an asynchronous communication protocol for modems using batch file transfers.
Yoda Condition: refers to a scenario when a piece of computer syntax is inverted or swapped around.
Yoyo Mode: a situation wherein a computer or a similar device seems stuck in a loop.
Zachman Framework: a visual aid for organizing ideas about enterprise technology.
Zend Optimizer: an open-source runtime application used with file scripts encoded by Zend Encoder and Zend Safeguard to boost the overall PHP application runtime speed.
Drill: an open-source distributed system that helps in performing interactive analysis over large-scale datasets.
Pentaho: provides a suite of open-source BI (Business Intelligence) products known as Pentaho Business Analytics that helps in OLAP services, data integration, dashboards, reporting, ETL capabilities, and data mining.
GPU-accelerated Databases: databases that are required to ingest streaming data.
Ingestion: the intake of streaming data from any number of different sources.
Normal Distribution: also known as Gaussian distribution or bell curve – a common graph representing the probability of a large number of random variables, where those variables approach normalcy as the data set increases in size.
Shard: an individual partition of a database.
Telemetry: the remote acquisition of information about an object. For example, collecting data from an automobile, smartphone, medical device, or IoT device.
Complex Event Processing (CEP): the process of analyzing and identifying data and then combining it to infer events that are able to suggest solutions to the complex circumstances.
Linked Data: the collection of interconnected datasets that can be shared or published on the web and collaborated with machines and users. It is highly structured, unlike big data. It is used in building Semantic Web in which a large amount of data is available in the standard format on the web.
Online Analytical Processing (OLAP): the process by which analysis of multidimensional data is done by using three operators – drill-down, consolidation, and slice and dice.
Online transactional processing (OLTP): the process that provides users access to the large set of transactional data.
Operational Data Store (ODS): a location to collect and store data retrieved from various sources.
Parallel Method Invocation (PMI): the system that allows program code to call or invoke multiple methods/functions simultaneously at the same time.
Recommendation Engine: an algorithm that performs the analysis of various actions and purchases made by a customer on an e-commerce website.
Correlation analysis – the analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).
Histogram – A graphical representation of the distribution of a set of numeric data, usually a vertical bar graph
5 V’s of Big Data
Some of these terminologies were in use for long. But they gained all the more attention due to ‘Big Data’. This glossary list can come in handy whenever you work on Big Data.
Do let me know in the comments if I need to capture any more relevant terms and definitions.