Accelerating Data Engineering Pipelines
(가속 데이터 엔지니어링 파이프라인)
-
Duration
1day 8hous
-
Language
English
-
Technologies
pandas, cuDF, Dask, NVTabular, Plotly
Workshop Details
Data engineering is the foundation of data science and lays the groundwork
for analysis and modeling.
In order for organizations to extract knowledge and
insights from structured and unstructured data,
fast access to accurate and
complete datasets is critical. Working with massive amounts of
data from
disparate sources requires complex infrastructure and expertise. Minor inefficiencies
can result in major costs, both in terms of time and money, when scaled across millions to trillions of data points.
In this workshop, we’ll explore how GPUs can improve data pipelines and how using advanced data engineering tools and techniques can result in significant performance acceleration. Faster pipelines produce fresher dashboards and machine learning (ML) models, so users can have the most current information at their fingertips.
-
Prerequisites
- Intermediate knowledge of Python (list comprehension, objects)
- Familiarity with pandas a plus
- Introductory statistics (mean, median, mode)
-
Assessment Type
Skills-based coding assessments evaluate your ability to efficiently filter through millions of data points in the context of an interactive dashboard.
-
Certificate
Upon successful completion of the assessment, participants will receive an NVIDIA DLI certificate to recognize their subject matter competency and support professional career growth.
-
Hardware Requirements
You’ll need a desktop or laptop computer capable of running the latest version of Chrome or Firefox. You’ll be provided with dedicated access to a fully configured, GPU-accelerated workstation in the cloud.
Learning Objectives
-
01
How data moves within a computer. How to build the right balance between CPU, DRAM, Disk Memory, and GPUs.
-
02
How different file formats can be read and manipulated by hardware.
-
03
How to scale an ETL pipeline with multiple GPUs using NVTabular.
-
04
How to build an interactive Plotly dashboard where users can filter on millions of data points in less than a second.
Workshop Outline
Introduction (15 mins) |
|
Data on the Hardware Level (60 mins) |
Explore the strengths and weaknesses of different hardware approaches to
data and the frameworks that support them :
|
Break (15 mins) | |
ETL with NVTabular (120 mins) |
Learn how to scale an ETL pipeline from 1 GPU to many with NVTabular through
the perspective of a big data recommender system.
|
Break (60 mins) | |
Data Visualization (120 mins) |
Step into the shoes of a meteorologist and learn how to plot precipitation data on a map.
|
Final Project: Data Detective (60 mins) |
Users are complaining that the dashboard is too slow.
Apply the techniques learned |
Final Review (15 mins) |
|