[파이썬] requests-html 페이지 스크롤 처리

In web scraping, sometimes we need to simulate scrolling through a webpage to load more content dynamically. This is a common technique used to scrape data from websites that implement infinite scrolling or lazy loading.

In this blog post, we will explore how to use the requests-html library in Python to handle page scrolling and scrape the dynamically loaded content.

Installing the necessary libraries

Before we start, make sure you have the requests-html library installed. You can install it using pip:

pip install requests-html

Simulating page scrolling

To simulate page scrolling, we need to render the JavaScript content of the webpage. The requests-html library provides a convenient way to do this.

Let’s take a look at a simple example:

from requests_html import HTMLSession

# Create an HTML Session object
session = HTMLSession()

# Send a GET request to the webpage
response = session.get('https://example.com')

# Render the JavaScript content
response.html.render()

# Scrape the dynamically loaded content
data = response.html.xpath('//div[@class="content"]/p/text()')

# Print the scraped data
for item in data:
    print(item)

In the above code snippet, we create an HTMLSession object and send a GET request to the webpage. We then render the JavaScript content using the render() method. This ensures that any dynamically loaded content is available for scraping.

Next, we use XPath to extract the desired data from the HTML tree. In this example, we extract the text from <p> elements that belong to a <div> with the class name “content”.

Finally, we print the scraped data to the console.

Handling infinite scrolling

To scrape data from websites with infinite scrolling, we need to simulate scrolling down the page multiple times. We can achieve this by repeatedly calling the scroll_down() method of the rendered HTML object.

Here’s an example that demonstrates infinite scrolling:

from requests_html import HTMLSession

# Create an HTML Session object
session = HTMLSession()

# Send a GET request to the webpage
response = session.get('https://example.com')

# Render the JavaScript content
response.html.render()

# Scroll down the page multiple times
for _ in range(3):  # Scroll down 3 times
    response.html.scroll_down(distance=100)

    # Wait for the new content to load
    session.browser.wait_for_scroll_position(response.html.page)

# Scrape the dynamically loaded content
data = response.html.xpath('//div[@class="content"]/p/text()')

# Print the scraped data
for item in data:
    print(item)

In this example, we scroll down the page three times using a for loop. After each scroll, we wait for the new content to load by calling session.browser.wait_for_scroll_position(). This ensures that we capture all the dynamically loaded content.

Conclusion

Using the requests-html library, we can easily simulate page scrolling and handle dynamic content while web scraping. By rendering the JavaScript content and scrolling down the page, we can access and scrape the dynamically loaded data.

This technique is useful when dealing with websites that employ techniques like infinite scrolling or lazy loading. It allows us to extract all the information we need from the webpage.

I hope this blog post has been helpful in understanding how to handle page scrolling in Python using the requests-html library. Happy scraping!