How to Trigger Thousands of Cloud Functions Using Pub/Sub & Databricks: A Comprehensive Guide

Pranit Sherkar
4 min readFeb 23, 2024

--

In today’s fast-paced digital environment, handling thousands, if not millions, of tasks efficiently is a common challenge for developers and engineers. Recently, I embarked on a journey to make 1000s of Python request calls to a website to fetch some data for analysis. Through trial and improvement, I discovered a highly effective method using Google Cloud Functions triggered by Pub/Sub messages. In this article, I’ll walk you through the evolution of our approach, including what worked, what didn’t, and how Pub/Sub changed the game for us.

The Initial Approach: Loops and API Calls in Databricks

Our journey began with a seemingly straightforward strategy: using Databricks to loop through tasks and make API calls iteratively. This method, while simple to implement, quickly proved inefficient for our needs. Here’s a basic example of how this looked in code

import requests

def fetch_data(url):
response = requests.get(url)
return response.json()

urls = ["http://example.com/api/data?id={}".format(i) for i in range(1000)]

for url in urls:
data = fetch_data(url)
# Process data

Although this approach is easy to understand, it struggles with scalability and efficiency, making it unsuitable for our requirements.

The Second Attempt: Spark Partitioning and Thread Pooling

To overcome the limitations of the initial method, we explored two parallelization strategies: Spark partitioning and using a ThreadPoolExecutor. Both methods aimed to utilize concurrent processing to speed up the task.

Spark Partitioning

Leveraging Spark’s distributed computing capabilities, we attempted to partition the work across multiple nodes. This approach improved performance but still fell short of our expectations.

from pyspark.sql import SparkSession

def fetch_data_partition(partition):
# Assuming `partition` is a list of URLs
results = []
for url in partition:
response = requests.get(url)
results.append(response.json())
return results

spark = SparkSession.builder.appName("DataFetch").getOrCreate()
urls_rdd = spark.sparkContext.parallelize(urls, numSlices=100)
results = urls_rdd.mapPartitions(fetch_data_partition).collect()

ThreadPoolExecutor

Similarly, we experimented with Python’s concurrent.futures.ThreadPoolExecutor for parallel processing, which offered a more straightforward implementation than Spark but still didn't achieve the desired scalability.

from concurrent.futures import ThreadPoolExecutor
import requests

def fetch_data(url):
response = requests.get(url)
return response.json()

urls = ["http://example.com/api/data?id={}".format(i) for i in range(1000)]

with ThreadPoolExecutor(max_workers=50) as executor:
future_to_url = {executor.submit(fetch_data, url): url for url in urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
data = future.result()
# Process data

Transitioning to Google Cloud Functions and HTTPS Calls

Realizing the limitations of our previous methods, we pivoted to leveraging Google Cloud Functions, intending to trigger these functions via HTTPS calls. This approach sort of worked initially and seemed promising as it allowed us to offload the execution to a serverless environment. However, as we scaled up to a million requests, we encountered errors due to synchronous calls and network limitations.

The Game-Changer: Pub/Sub Triggers for Cloud Functions

The breakthrough came when we adopted Google Cloud Pub/Sub as a trigger for Cloud Functions. This method allowed us to decouple the task submission from the execution, significantly improving scalability and performance. We set up 100 instances of Cloud Functions with a concurrency level of 100, enabling us to process 2.5 million requests in under 10 hours.

Here’s a simplified version of how we set up a Cloud Function triggered by Pub/Sub:

  1. Create a Pub/Sub Topic:

First, we created a Pub/Sub topic that would serve as the event source for triggering our Cloud Functions.

gcloud pubsub topics create my-topic

2. Deploy the Cloud Function:

Next, we deployed our Cloud Function, specifying the Pub/Sub topic as the trigger. The function is designed to process messages published to the topic.

gcloud functions deploy my-function --runtime python39 --trigger-topic my-topic --memory 128MB --timeout 540s --max-instances 100

3. Publish Messages to the Topic:

To trigger the function, we published messages (each representing a task or request) to the Pub/Sub topic. This could be done programmatically through the Pub/Sub API.

from google.cloud import pubsub_v1

publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')

def publish_messages():
for i in range(2500000):
# The data must be a byte string
data = f"Task {i}".encode("utf-8")
future = publisher.publish(topic_path, data)
print(future.result())

publish_messages()

Conclusion

Through experimentation and adaptation, we successfully scaled our data fetching operation to handle millions of requests efficiently. The key to our success was leveraging Google Cloud's Pub/Sub system to trigger Cloud Functions, enabling unparalleled scalability and performance. This journey underscored the importance of selecting the right tools and approaches for the task at hand, especially when working with cloud-native technologies. Whether you're dealing with data processing, web scraping, or any task requiring massive parallel processing, consider the power of Pub/Sub and Cloud Functions to elevate your project's efficiency and scalability.

--

--

Pranit Sherkar
Pranit Sherkar

Written by Pranit Sherkar

Data Engineer with passion to create clean and robust data pipelines

No responses yet