[파이썬] requests-html 웹사이트의 트렌드 파악

Python has become one of the most popular programming languages for web scraping and data analysis tasks. With its easy-to-use libraries and powerful capabilities, Python allows us to gather valuable insights from websites effortlessly.

One popular library for web scraping in Python is requests-html. It is a package that combines the flexibility of the Requests library with the power of HTML parsing via Pyppeteer.

In this blog post, we will explore how we can use requests-html to analyze the trends of the requests-html website itself. We will focus on fetching information such as the latest releases, top contributors, and recent issues.

Installation

Before we dive into the code, let’s make sure we have requests-html installed. You can install it using pip:

pip install requests-html

Fetching the Latest Releases

One way to understand the trends of any project is by looking at its latest releases. With requests-html, we can easily fetch the latest releases from the requests-html GitHub repository.

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://github.com/psf/requests-html/releases'
response = session.get(url)

latest_releases = []
release_elements = response.html.find('.release-entry')
for release in release_elements:
    version = release.find('.release-meta .release-title', first=True).text
    release_date = release.find('.release-meta .release-date', first=True).text
    latest_releases.append({
        'version': version,
        'release_date': release_date
    })

print(latest_releases)

In the above code, we first create an HTMLSession object and make a GET request to the GitHub repository’s releases page. We then extract the release version and date from each release element using CSS selectors. Finally, we store this information in a list of dictionaries.

Finding the Top Contributors

The top contributors of a project can give us insights into a project’s popularity and activity. Let’s use requests-html to identify the top contributors of the requests-html GitHub repository.

top_contributors = []
contributor_elements = response.html.find('.repo-contributor-list .contrib-data')
for contributor in contributor_elements:
    username = contributor.find('.contrib-username', first=True).text
    contributions = contributor.find('.contrib-number', first=True).text
    top_contributors.append({
        'username': username,
        'contributions': contributions
    })

print(top_contributors)

In the code snippet above, we find the contributor list on the GitHub repository page using CSS selectors. We then extract each contributor’s username and number of contributions. Finally, we store this information in a list of dictionaries.

Analyzing Recent Issues

Analyzing recent issues is another effective way to identify the trends in a project. Let’s retrieve the information about the most recent issues in the requests-html GitHub repository using requests-html.

recent_issues = []
issue_elements = response.html.find('.js-issue-row')
for issue in issue_elements:
    title = issue.find('.h4', first=True).text
    labels = [label.text for label in issue.find('.labels > .label')]
    recent_issues.append({
        'title': title,
        'labels': labels
    })

print(recent_issues)

In the above code, we find the issue list on the GitHub repository page using CSS selectors. We then extract the issue title and labels associated with each issue. Finally, we store this information in a list of dictionaries.

Conclusion

In this blog post, we have explored how to use the requests-html library in Python to analyze the trends of the requests-html website itself. We have seen examples of fetching the latest releases, finding the top contributors, and analyzing recent issues.

By leveraging the power of requests-html, we can gather valuable insights from any website and make informed decisions. Give it a try and unleash the potential of web scraping and data analysis in your Python projects!