Let’s see how it is possible to use Chromium and Selenium in an AWS Lambda function; first, some information for those unfamiliar with these two projects.
Chromium is the open source browser from which Google Chrome derives. Browsers share most of the code and functionality. However, they differ in terms of license and Chromium does not support Flash, does not have an automatic update system and does not collect usage and crash statistics. We will see that these differences do not affect the potential of the project in the least and that, thanks to Chromium, we can perform many interesting tasks within our Lambda functions.
Selenium is a well-known framework dedicated to testing web applications. We are particularly interested in Selenium WebDriver, a tool that allows to control all the main browsers available today, including Chrome/Chromium.
Why use Chrome in a Lambda function?
What is the purpose of using a browser in an environment (AWS Lambda) that does not have a GUI? In reality there are several. Operating a browser allows you to automate a series of tasks. It is possible, for example, to test your web application automatically by creating a CI pipeline. Or, not least, web scraping.
The web scraping is a technique that allows to extract data from a website and its possible applications are infinite: monitor product prices, check the availability of a service, build databases by acquiring records from multiple open sources.
In this post we will see how to use Chromium and Selenium in a Lambda function Python to render an URL.
Example lambda function
The Lambda Function we are going to create has a specific purpose: given a URL, Chromium is used to render the related web page and a screenshot of the content is captured in PNG format. We are going to save the image in an S3 bucket. By running the function periodically, we will be able to “historicize” the changes of any site, for example the homepage of an online information newspaper.
Obviously our Lambda has many possibilities for improvement: the aim is to show the potential of this approach.
Let’s start with the required Python packages (requirements.txt).
Packages are not normally available in the AWS Lambda Python environment: we need to create a zipfile to distribute our function including all dependencies.
We get Chromium, in its Headless version, that is able to run in server environments. We need the relevant Selenium driver. Here are the links I used.
How to use Selenium Webdriver
A simple example of how to use Selenium in Python: the get method allows to direct the browser to the indicated URL, while the save_screenshot method allows to save a PNG image of the content. The image will have the size of the Chromium window, which is set using the window-size argument to 1280×1024 pixels.
I decided to create a wrapper class to render a page in all its height . We need to calculate the size of the window required to contain all the elements. We use a trick, a script that we will execute after page loading. Again Webdriver exposes an execute_script method that is right for us.
The solution is not very elegant but functional: the requested URL must be loaded twice. First time browser window is set to a fixed size. After page loading, a JS script is used to determine the size of the required window. After that, original browser window is closed and a new one, correctly sized, is opened to get the full height screenshot. The JS script used was taken from this interesting post .
Wrapper is also “Lambda ready”: it correctly manages the temporary paths and the position of the headless-chromium executable specifying the necessary chrome_options.
Let’s see what our Lambda function is like:
The event handler of the function simply takes care of instantiating our WedDriverScreenshot object and uses it to generate two screenshots: the first with a fixed window size (1280×1024 pixels). For the second, the Height parameter is omitted, which will be automatically determined by our wrapper.
Here are the two resulting images, compared with each other, relating to the site www.repubblica.it
Deployment of the function
I collected in this GitHub repository all files needed to deploy the project on AWS Cloud. In addition to the files already analyzed previously, there is a CloudFormation template for stack deployment. The most important section obviously concerns the definition of the ScreenshotFunction : some environment variables such as PATH and PYTHONPATH are fundamental for the correct execution of the function and Chromium. It is also required to pay attention to the memory requirements and the timeout setting: loading a page can take several seconds. The lib path includes some libraries required for the execution of Chromium and not present by default in the Lambda environment.
As usual I prefer to use a Makefile for the execution of the main operations.
## download chromedriver, headless-chrome to `./bin/` make fetch-dependencies ## prepares build.zip archive for AWS Lambda deploy make lambda-build ## create CloudFormation stack with lambda function and role. make BUCKET=your_bucket_name create-stack
The CloudFormation template requires an S3 bucket to be used as a source for the Lambda function deployment. Subsequently the same bucket will be used to store the PNG screenshots.
We used Selenium and Chromium to render a web page. It is a possible application of these two projects in the serverless field. As anticipated, another very interesting application is web scraping. In this case, Webdriver’s page_source attribute allows access to the sources of the page and a package such as BeautifulSoup can be very useful for extracting the data we intend to collect. The Selenium package for Python offers several methods to automate other operations: please check the excellent documentation.
We will see an example of web scraping in one of the next blog posts! At the moment, have a look of this post GitHub repository.
Did we have fun? See you next time!