Apache Spark is a type of technology that uses distributed systems. In this article, we explain what it is, the key concepts to keep in mind, and provide guidance to help you start using it easily.
What is Apache Spark?
Apache Spark is a technology that employs distributed processing, enabling it to scale and store virtually unlimited volumes of data in a cost-effective manner.
Distributed systems are a group of computers or servers that work together in a coordinated way as if they were a single entity. We could think of them as an orchestra, where different instruments play harmoniously as a whole.
In this article, we delve into how distributed systems work in detail.
Key concepts to understand how Apache Spark works
Before diving deeper into the topic, it’s essential to know the key concepts to understand and work with Apache Spark:
- Driver: Responsible for coordinating all the work. It’s important to note that the driver doesn’t handle the data directly; it only provides instructions and then retrieves results.
- Executors: These are the workspace. They are virtual machines where the code is executed, and all operations take place.
- Cluster: A set of virtual machines that consists of a driver and several executors. Technically, it is composed of nodes.
- Nodes: Physical or virtual machines that, together, form the cluster. Within the nodes are slots, which represent the computing portion of each executor node. Slots are responsible for executing tasks assigned by the driver and are the smallest unit available for parallelization. A node can contain multiple slots, and their number can be configured based on requirements.
- Resources: Always shared (RAM, disk, network) among all slots.
- Parallelization: The method of executing tasks. Instead of processing them sequentially, tasks are run simultaneously across different nodes to speed up processing.
How Does Apache Spark Work?
Apache Spark allows tasks to be broken down into multiple parts. Let’s explain this with an example to make it easier to understand.
Imagine a classroom with a teacher and several students. The teacher has a bag with multiple packets of M&M candies in different colors. The teacher gives each student a packet and asks them to separate the blue ones.
Now, let’s translate this into the world of data. We start with a dataset (the M&Ms). The driver (the teacher) defines how the data will be organized and divides it into partitions. Once the partitions are created, it assigns tasks to each one (the students). Remember, the driver doesn’t handle the data directly.
How Apache Spark Works (Driver, Cluster, Executors)
At this stage, a new concept comes into play: jobs in an Apache Spark application. A job is the instruction we provide, such as filtering. The application is the code (written in SQL, Python, Scala, R, etc.), and the driver determines how many jobs are needed to execute that application.
Each job consists of different tasks. Stages are the phases into which a job is divided. When the instructions are complex, they are broken down into a larger number of tasks, which are grouped into stages that form part of the job.
Complex tasks are split into smaller blocks to solve them step by step, as illustrated in the following image:
Execution in Apache Spark
If we need to increase or decrease parallelism, we can add (or remove) more executors (machines).
Examples of Applications That Use Apache Hoy
Every day, we use applications whose algorithms run on Apache Spark. Some of the most common examples include:
- Netflix’s recommendation systems for series or movies.
- Spotify’s Discover Weekly feature.
- Airbnb’s suggestions for “best prices” in a specific location we might be interested in visiting.
How to Start Using Apache Spark
We know that getting started with Apache Spark can be challenging; however, there are much simpler ways to do so today.
The easiest way—and the one we recommend for beginners—is to open a free account on Databricks Community, where you can access the full Databricks front-end. This platform has simplified many services that were previously complex.
Once there, you can select the machine. Keep in mind that the free version has certain limitations, but it is more than enough for practice.
Databricks is a Platform as a Service (PaaS), making it much easier to implement Apache Spark. Additionally, it runs on the three major cloud providers in the market (Azure, AWS, and GCP). There are many ways to use it, depending on the scenario you’re working with. The good news is that the cloud offers options to suit all needs.
How to Learn More About Apache Spark
We recommend two approaches, depending on the person’s profile and the level of technical depth they need to achieve:
- Use the Databricks Community Edition: As explained above, this allows you to set up a small cluster to start practicing. From this perspective, you’ll learn to use Apache Spark with a 100% data engineering focus. The emphasis is on understanding what needs to be done and how to do it, avoiding the complexities associated with the underlying architecture.
- Apache Spark Standalone: If you’re more interested in technical details, you can download Apache Spark and set up a standalone installation. This means installing it on your own machine and running everything in your local environment. Keep in mind that this option is somewhat more complex.
Conclusion
Apache Spark is not only a powerful tool for distributed data processing but also accessible to both beginners and experts.
Thanks to its ability to scale and divide tasks in parallel, it enables efficient and cost-effective management of large volumes of information. Whether through intuitive platforms like Databricks or more technical standalone installations, Apache Spark offers flexibility to adapt to different needs and experience levels.
In a data-driven world, learning to use Apache Spark can be a crucial step toward optimizing projects and tackling complex challenges in the field of Data & AI.
–
The original version of this article was written in Spanish and translated into English with ChatGPT.