[파이썬] requests-html 웹 페이지의 스크립트 실행 결과 추출

In web scraping, sometimes we encounter web pages that execute JavaScript to dynamically generate or modify content. Extracting the executed JavaScripts’ results is crucial to acquire the complete data. In this blog post, we will explore how to use the requests-html library in Python to extract the execution results of JavaScript on web pages.

What is requests-html?

requests-html is a Python library that allows us to easily extract and parse data from web pages. It provides an interface similar to the popular requests library, but with added features for handling dynamic content. It leverages the power of the pyppeteer library for executing JavaScript code within a headless browser.

Installing requests-html

Before starting, make sure you have the requests-html library installed. You can install it using pip:

$ pip install requests-html

Extracting JavaScript execution results

To extract JavaScript execution results using requests-html, we follow these steps:

  1. Import the necessary modules:
from requests_html import HTMLSession
  1. Create an HTML session:
session = HTMLSession()
  1. Fetch the web page content:
response = session.get('https://example.com')
  1. Execute any pending JavaScript on the page:
response.html.render()
  1. Extract the results:
results = response.html.xpath('//script/text()')

In the above example, we first import the HTMLSession class from the requests_html module. Then, we create a session object and use it to fetch the web page content by calling the get() method with the URL of the page.

After fetching the content, we execute any pending JavaScript on the page by calling the render() method on the response.html object. This method automatically executes the JavaScript, allowing us to access the fully rendered page.

Finally, we extract the JavaScript execution results by using the xpath() method on the response.html object. In this example, we extract the text content of all <script> tags on the page.

Conclusion

In this blog post, we have learned how to extract JavaScript execution results from web pages using the requests-html library in Python. By following the mentioned steps, we can easily interact with the fully rendered page and access the dynamic content generated by JavaScript. This enables us to gather complete data while web scraping using requests-html.