In web scraping, sometimes we encounter web pages that execute JavaScript to dynamically generate or modify content. Extracting the executed JavaScripts’ results is crucial to acquire the complete data. In this blog post, we will explore how to use the requests-html
library in Python to extract the execution results of JavaScript on web pages.
What is requests-html?
requests-html
is a Python library that allows us to easily extract and parse data from web pages. It provides an interface similar to the popular requests
library, but with added features for handling dynamic content. It leverages the power of the pyppeteer
library for executing JavaScript code within a headless browser.
Installing requests-html
Before starting, make sure you have the requests-html
library installed. You can install it using pip:
$ pip install requests-html
Extracting JavaScript execution results
To extract JavaScript execution results using requests-html
, we follow these steps:
- Import the necessary modules:
from requests_html import HTMLSession
- Create an HTML session:
session = HTMLSession()
- Fetch the web page content:
response = session.get('https://example.com')
- Execute any pending JavaScript on the page:
response.html.render()
- Extract the results:
results = response.html.xpath('//script/text()')
In the above example, we first import the HTMLSession
class from the requests_html
module. Then, we create a session object and use it to fetch the web page content by calling the get()
method with the URL of the page.
After fetching the content, we execute any pending JavaScript on the page by calling the render()
method on the response.html
object. This method automatically executes the JavaScript, allowing us to access the fully rendered page.
Finally, we extract the JavaScript execution results by using the xpath()
method on the response.html
object. In this example, we extract the text content of all <script>
tags on the page.
Conclusion
In this blog post, we have learned how to extract JavaScript execution results from web pages using the requests-html
library in Python. By following the mentioned steps, we can easily interact with the fully rendered page and access the dynamic content generated by JavaScript. This enables us to gather complete data while web scraping using requests-html
.