Top 15 terms Every Big Data Professional Must know

Mark Taylor
4 min readMar 12, 2020

--

According to SAS, “Big data is a term that describes the large volume of data — both structured and unstructured — that inundates a business on a day-to-day basis.” The important part, however, is not so much the volume of data as what an organization does with it — it could be a great source of insights to guide better strategic decision-making. These possibilities have brought the big data industry into prominence, and attracted numerous people to the idea of taking up a job as a big data professional.

If you are looking for a big data career, here are the top 25 terms you must familiarize yourself with:

👉Algorithm

According to geeksforgeeks.org, this means “a process or set of rules to be followed in calculations or other problem-solving operations.” Therefore, an algorithm is essentially a set of rules or instructions defining, step by step, how a certain task is supposed to be executed such that the expected results are duly generated. Here, it refers to a formula or statistical process to analyze data.

👉Analytics

In the context of the big data industry, this refers to the examination of large amounts of data with the goal of unearthing correlations, patterns, and other insights that could possibly have remained hidden otherwise. The purpose, according to TechTarget, is to “uncover information — such as hidden patterns, unknown correlations, market trends and customer preferences — that can help organizations make informed business decisions.”

👉Descriptive analytics

This refers to a preliminary stage of data processing, which aims to summarize historical data so that some useful information can be pulled out, and the data itself can possibly be prepared for further analysis. It essentially gives information about something that has already happened.

👉Predictive analytics

This refers to “the use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data” (according to SAS). The goal here is a step beyond what descriptive analytics unearths — what has already happened — and looks to predict what is likely to happen in the future.

👉Prescriptive analytics

Another task handled in a big data career, this suggests a course of action or a strategy by factoring in details about available resources, past and current performance, and possible situations or scenarios.

👉Batch processing

This refers to processing transactions in groups or batches, and does not require any user interaction once the process begins. For a big data professional dealing with large data sets, this is particularly helpful as an efficient way of processing large data volumes gathered over a period of time.

👉Cassandra

Managed by the Apache Software Foundation, this is a highly scalable and available distributed database that helps to store and manage high-velocity structured data across multiple commodity servers while maintaining no single point of failure. This is considered one of the most efficient NoSQL databases.

👉Cloud computing

This essentially is data and/or software accessible from anywhere through the Internet, as it is hosted and running on remote servers. It helps the big data industry by facilitating storage and analysis of large amounts of data without needing onsite facilities.

👉Cluster computing

This refers to the use of pooled resources i.e. multiple servers to handle data. In the context of big data, a Hadoop cluster is a special type of computational cluster that has been designed in particular to store and analyze huge amounts of unstructured data in a distributed computing environment.

👉Dark data

This refers to the categories of data that are gathered and processed by enterprises, but are never really used for any meaningful purposes and may possibly never be analyzed. In the context of big data, this could be an untapped source of insights, and if poorly handled, could bring about legal and security issues, among others.

👉Data lake

A data lake is a large repository of enterprise-wide data stored in a raw format. It could contain structured, semi structured, and unstructured data, which companies can use for big data analytics and making strategic decisions.

👉Data mining

This refers to the use of advanced techniques of pattern recognition on large data sets to unearth meaningful patterns and useful insights.

👉Data scientist

One of the most popular job positions for a big data professional, this refers to someone who works on raw data sets — sourced from data lakes, among others — to pull out useful insights. A highly-paid position, it involves competence in disciplines such as analytics, statistics, and computer science, along with a dose of creativity, story-telling and an understanding of business context.

👉Distributed file system

Commonly used for big data (as it is by definition too large to store on a single system), this is a data storage system where large volumes of data are stored across multiple storage devices. This is commonly used by Hadoop, for instance.

👉ETL

An acronym for Extract, Transform and Load, this refers to the process whereby raw big data is extracted, cleaned and enriched (i.e. transformed) to make it fit for further use, and loading it into the appropriate repository.

The above are just some of the many terms that you will come across and learn about in a big data career.

--

--

Mark Taylor
Mark Taylor

Written by Mark Taylor

Professional data scientist, Data Enthusiast. #DataScience #BigData #AI #MachineLearning #Blockchain

No responses yet