I have a webscraper that I’ve written with beautifulsoup4 to scrape indeed for a list of skills associated with GIS and how many jobs are associated with those skills. The results are returned as a JSON list.
Now this is very useful by itself, but what if I want to have it be a serverless function that can be run from anywhere and not have to run it on my local machine? Maybe I could even have it run at set intervals and store the results in a database later.
Well luckily there’s something called AWS lambda that lets me do just that. Here’s how to port your python functions to lambda.
Let’s take a look at this code.
Notice that I’ve called my function lambda_handler, this is important for portability with lambda. The python file must also be called lambda_function.py
This is the folder that I have my function in. I have it named indeed_scraper.py so I will need to rename it to lambda_function.py.
My function relies on packages that are not included in the standard library, namely beautifulsoup4 and requests. We will need to install these packages locally.
We can do this by adding the -t flag to pip.
If the packages were successfully installed, your folder should look like this:
Change directory into this folder and create a .zip
Follow the steps outlined in an earlier tutorial
to create a basic lambda function
Go to your lambda page and upload the .zip with your python file and package contents.
Hit save to upload your files.
If it worked your code should appear in the code window.
My function takes about 60 seconds to run, so I will increase the max allowed runtime from 3 seconds to 1 min 30 seconds.
Save and hit the test button when you are ready, the test input doesn’t matter here because we are scraping from predefined website URLs and don’t use any input.
A successfull web scraping! The output is returned as a JSON list, and we can easily store it with whatever AWS service we choose. Since we are using python we would need to use the boto3 package on Amazon to achieve this.