The Agenda

Agenda Home
SQLBits 2024 runs from Tuesday 19th – Saturday 23rd March.
Data Integrator

Spark for Data engineers

Description

Data Analysts, data Scientist, Business Intelligence analysts and many other roles require data on demand. Fighting with data silos, many scatter databases, Excel files, CSV files, JSON files, APIs and potentially different flavours of cloud storage may be tedious, nerve-wracking and time-consuming.

Automated process that would follow set of steps, procedures and processes take subsets of data, columns from database, binary files and merged them together to serve business needs and potentials is and still will be a favorite job for many organizations and teams.

Apache Spark™ is designed to to build faster and more reliable data pipelines, cover low level and structured API and brings tools and packages for Streaming data, Machine Learning, data engineering and building pipelines and extending the Spark ecosystem.

Spark is an absolute winner for this tasks and a great choice for adoption.

Data Engineering should have the extent and capability to do:

- System architecture
- Programming
- Database design and configuration
- Interface and sensor configuration

And in addition to that, it is as important as familiarity with the technical tools is, the concepts of data architecture and pipeline design are even more important. The tools are worthless without a solid conceptual understanding of:

- Data models
- Relational and non-relational database design
- Information flow
- Query execution and optimisation
- Comparative analysis of data stores
- Logical operations

Apache Spark have all the technology built-in to cover these topics and has the capacity for achieving a concrete goal for assembling together functional systems to do the goal.


Workshop Title: "Spark for Data Engineers"

Target Audience: Data engineer, BI Engineer, Cloud data engineer

Broader Audiance: Analysts, BI Analysts, Big Data analysts, DevOps data engineer, Machine Learning engineer, Statisticians, Data Scientist, Database Administrator, Data Orchestrator, Data Architect

Prerequisite knowledge for attendees:
Data engineering tasks:
- analyzing and organizing raw data (with T-SQL or Python or R or Scala)
- buidling data transformations and pipelines (with T-SQL or Python or R or Scala)

Technical prerequisite for attendees:
- working laptop with ability to install Apache Spark and other tools
- Access to internet
- Credentials and credit (free credit) for accessing Azure portal


Agenda for the day (9AM – 5PM; Start and end time can vary and will be finalised with organizator)

1. Module 1 (9.00AM – 10.00 AM): Getting to know Apache Spark, Installation and setting up the environment
2. Coffee Break 15'
3. Module 2 (10.00 – 11.15): Creating Datasets, organising raw data and working with structured APIs
4. Coffee Break 15'
5. Module 3 (11:30 – 13.00): Designing and building pipelines, moving data and building data models with Spark
6. Lunch: 13.00 – 14.00
7. Module 4: Data and process orchestration, deployment and Spark Applications (14.00 - 15.00)
8. Coffee break 15'
9. Module 5: Data Streaming with (15.15 - 16.15)
10. Module 6: Ecosystem, tooling and community (16.15 - 17.00)

All modules have hands-on material that will be given to attendees at the beginning of the training.

Feedback link: https://sqlb.it/?6188

Learning Objectives

Previous Experience

Tech Covered

Azure, Spark, deployment, On Premise, Data Integrator, Managing Big Data, Managing, On Premises