50 Data Engineering Jargon #20
The world of Data Engineering has evolved significantly over the past decade. From a once humble data warehousing set-up to an entire discipline. This includes everything from data pipelines to building data lakes.
The popularity of this field has increased, as has the jargon.
Here, I’ll cover 50 terms and break down their definition with examples
Let’s get started:
1. Data Dump
A file or a table containing a significant amount of data to be analysed or transferred.
A table containing the “data dump” of all customer addresses.
2. Data Pipelines
A data processing method akin to a pipeline, which starts with data ingestion then processing then completion.
A pipeline where customer address data is ingested from source A and then aggregated according to their cities and this new information is loaded into destination B.
Database Administrator is an admin role that understands the particular database technology and how to get the best out of it. This includes improving performance, backups and recovery.
Performance tuning the database to respond better to particular complex data queries.
4. Data Warehouse
A method of organising data to make it easy to analyse and report to make business decisions
Oracle data warehouse. Organising customer data in a data warehouse to be able to report the number of newly acquired customers.
5. Data Mart
A subset of a data warehouse, created for a very specific business use case.
Finance data mart storing all the relevant financial information required by the Accounting team to process their month-end cycles.
Operational data store generally stores limited and current information to help simple queries. Unable to handle historical or complex data queries.
An ODS for daily stock fluctuations in a warehouse to help the warehouse manager decide what to prioritise in the next order delivery.
The same as a data warehouse except it includes all the data within an organisation. This means that the entire enterprise can rely on this warehouse for their business decisions.
Organising sales, customer, marketing and finance data in an enterprise data warehouse to be able to create several key management reports.
Relational database management system. All of the above examples are RDBMS, meaning they store data in a structured format using rows and columns.
A Microsoft SQL server database.
9. In-memory DB
Traditional databases have been used for complex calculations and queries. They store information on the actual disk in the computer. In-memory DB stores all the information on their memory (RAM), this allows for rapid calculations without read and write a function to a normal disk.
A drill-down functionality of a live dashboard.
10. Data Lake
A repository for all kinds of structured and unstructured data. Mainly based on Hadoop storage technology. Called a lake as it is flexible enough to store anything from raw data to unstructured email files.
Hadoop Data Lake. Storing logs of all customers called into the inbound call centre including call duration.
Generally, the first step in a data pipeline where data is ingested into the platform.
A pipeline where customer address data is ingested from source A.
12. Extract, Transform, Load (ETL)
A 3-step process of extracting data and transforming it (by applying some kind of logic like aggregation) and loading the new information into the destination. It could be used as ELT where the destination tables transform the data instead.
An extract of customer address data is taken from the customer relationship management tool and is then aggregated according to their cities and this new information is loaded into destination B.
13. Data Models
A way of organising the data in a way that it can be understood in a real-world scenario.
Taking a huge amount of data and logically grouping it into customer, product and location data.
A method of organising the data in a granular enough format that it can be utilised for different purposes over time. Usually, this is done by normalising the data into different forms such as 1NF (normal form) or 3NF (3rd normal form) which is the most common.
Taking customer order data and creating granular information model; order in one table, item ordered in another table, customer contact in another table, payment of the order in another table. This allows for the data to be re-used for different purposes over time.
15. Star schema
The simplest way to model data into different quantitative and qualitative data called facts and dimensions. Usually, the fact table is interpreted with the help of a dimensions table resembling a star.
A Star schema of sales data with dimensions such as customer, product & time.
A data warehousing term for quantitative information.
The number of orders placed by a customer.
A data warehousing term for qualitative information.
Name of the customer or their country of residence.
A term for a collection of database objects. These are generally used to logically separate data within the database and apply access controls.
Storing HR data in HR schema allows the logical segregation from other data in the organisation.
19. SCD Type 1 – 6
A method to deal with changes in the data over time in a data warehouse. Type 1 is when history is overwritten whereas Type 2 (most common) is when history is maintained each time a change occurs.
When a customer changes their address; SCD Type 1 would overwrite the old address with the new one, whereas Type 2 would store both addresses to maintain history.
20. Business Intelligence
A slightly out of date term for a combination of practices to derive business insights from data by predominantly using data warehousing, analytics and dashboarding.
Creating a management dashboard to show customer demographics across the country.
21. Batch Processing
An automated way of processing millions of data transactions at the same time. This is generally carried out overnight with the help of “batch jobs”.
Loading all the customer’s data that bought a particular item on the day.
SQL is a Structured Query Language or simply put a language used to manage databases. T-SQL is Transact-SQL which is a proprietary Microsoft extension of the SQL language.
T-SQL can be used MS SQL Server or Azure SQL Database to write a statement as follows “SELECT customer_name from tbl_customer_information where customer_city = “London”. This provides the result of all the customer names where customers are based in London.
Although SQL has been around for decades. NoSQL (not only SQL) is a concept designed for non-relational databases particularly to store unstructured data like documents.
Storing an Outlook email file in XML with key-value pair on a MongoDB document database.
Batch Teradata Query (like SQL) is simply a utility and query tool for Teradata which is a relational database system
Creating a BTEQ script to load data from a flat-file.
Delivery of computing services such as servers, networking, analytics etc. over the internet instead of using a dedicated data centre for an organisation.
Storing data on Microsoft’s Azure Cloud service instead of on an on-premise solution.
26. Data Architecture
The discipline of managing the people, process and technologies relating to data; this includes data strategy, data capture processes and technical patterns to derive insight from the data.
A Data Architect creates a framework for an enterprise to manage its data flow end to end.
27. Data Visualisation
A practice for visualising large amounts of data to derive key insights to drive business decisions.
An executive dashboard that clearly outlines the sales performance of a certain team.
28. Data Centres
A dedicated space (nowadays millions of sqft of space) which houses servers and systems for the organisation’s key applications
Microsoft Data Centre to host all the company’s key applications.
29. Data Integration
Usually, the hardest part of the project, where multiple sources of data are integrated into a singular application/data warehouse.
Integrating finance and customer relationship systems integrating into an MS SQL server database.
30. Data Migration
The practice of migrating the data from source to destination
Migrating data from MS SQL server database to an Amazon Relational Database service
31. Data Replication
There are multiple ways to do this, but mainly it is a practice of replicating data to multiple servers to protect an organisation against data loss.
Replicating the customer information across two databases, to make sure their core details are not lost.
32. Big Data
A term coined for large amounts of data that cannot be processed using traditional databases. Refer to Data Lake above.
Hadoop Data Lake to store all the information received from sensors in a smart fridge.
Apache Hive is a data warehouse open-source project which allows querying of large amounts of data. Like SQL it uses an easy-to-understand language called Hive QL
SELECT * from tbl; returns all rows and columns from a data store like HDFS.
Hadoop Distributed File System is a data storage system used by Hadoop. It provides flexibility to manage structured or unstructured data.
Storing large amounts of financial transactional data in an HDFS to query using Hive QL.
It is an open-source extract, transform and load tool (refer to ETL), this allows filter, integrating and joining data.
Moving postcode data from a .csv file to HDFS using NiFi.
More complex to work with than NiFi as it doesn’t have a user interface (UI), mainly used for real-time streaming data. It is a messaging system first created by LinkedIn engineers.
Streaming real-time weather events using Kafka
37. Flat File
Commonly used to transfer data due to their basic nature; flat files are a single table storing data in a plain text format.
All customer order numbers stored in a comma-separated value (.csv) file
The time it takes for a database or a web application to respond to a query or a click.
Takes 30 seconds to query a database with 5 million records.
This is when limited data is stored on the RAM to allow for quick retrieval of information.
In-memory caching of data in a database returns results to query 100 times faster.
The name of a storage area that is temporary in nature; to allow for processing of ETL jobs (refer to ETL).
A staging area in an ETL routine to allow for data to be cleaned before loading into the final tables.
Usually refers to an environment where extensive testing can be carried out without compromising the sanctity of the live platform.
A sandbox to prove a concept of keyboard metric before getting this accepted in a live environment.
42. Subject Area
A way of defining a data model by grouping the enterprise’s data according to known business directorates.
A customer subject area containing all customer information that can be utilised across the business
43. Raw Data
This is the data as it has been collected in its rawest format before it is processed, cleansed and loaded.
Raw data of all the customer’s orders from the day of trading
44. Transactional Data
This is data that describes an actual event.
Order placed, a delivery arranged, or a delivery accepted.
45. Reference Data
This is data that allows the classification of other data.
Country code GB representing Great Britain.
46. Master Data
This is data that is the best representation of a particular entity in the business. This gives you a 360 view of that data entity by generally consolidating multiple data sources.
Best customer data representation from multiple sources of information.
47. Structured Data
Data that is nicely organised in a table using rows and columns, allowing the user to easily interpret the data.
Finance data in a database table, easily queryable using SQL.
48. Unstructured Data
Data that cannot be nicely organised in a tabular format, like images, PDF files etc.
An image stored on a data lake cannot be retrieved using common data query languages.
49. Data Quality
A discipline of measuring the quality of the data to improve and cleanse it.
Checking Customer data for completeness, accuracy and validity.
50. Data Management
A discipline encompassing the end-to-end management of data lifecycle, including acquiring, transferring, securing and querying data.
Combination of improving the quality of data, governing the data, enriching and cleansing the data.
That was a long list. I hope you learnt the meaning of a few terms from this, and it continues to build your understanding within Data Engineering. It certainly helped me clarify my thinking.
Do you agree with what I’ve said above, what are your thoughts? Feel free to reach out to me via my email at [email protected], if you have some feedback or if you just want to say hello!
If you’re still reading this, I hope you’ve found some value in this blog post.
If you’d like to be kept informed of more content like this, subscribe to my newsletter.
Also, check out my other blog on Why there is so much data in the world?