[파이썬] requests-html robots.txt 준수하기

robot

Welcome to another tech blog post where we will learn about how to comply with the robots.txt file using the requests-html library in Python. The robots.txt file plays a crucial role in informing web crawlers about which parts of a website they are allowed to access. To respect the guidelines set by a website’s administrators, it is important for web scraping bots to comply with the rules defined in the robots.txt file.

What is robots.txt?

The robots.txt file is a text file placed at the root directory of a website that specifies the rules for web crawlers or robots. It is used to control the behavior of web crawlers, informing them which pages or directories are allowed or disallowed for crawling, and setting other directives such as crawl delays.

Why comply with robots.txt?

Respecting the rules outlined in the robots.txt file is not only an ethical practice, but it also ensures the smooth functioning of web scraping activities. Failing to comply with the rules may lead to penalties, IP blocking, or legal consequences.

Using requests-html library for robots.txt compliance

To comply with the robots.txt rules, we can use the requests-html library in Python. This library provides a simple and intuitive way to make HTTP requests and parse the response HTML. It also allows us to respect the robots.txt file by checking the permissions before making any requests.

Here’s an example of how to extract information from a webpage while respecting the robots.txt rules:

from requests_html import HTMLSession

session = HTMLSession()

def is_allowed(url):
    rules_url = url + "/robots.txt"
    response = session.get(rules_url)
    
    if response.status_code == 404:
        return True
    
    rules = response.text.lower()
    user_agent = "my-web-scraper"

    if "user-agent: *" in rules:
        rules_agent = rules.split("user-agent: *")[1]

        if user_agent in rules_agent:
            disallowed_paths = [x[11:] for x in rules_agent.split("\n") if x.startswith("disallow:")]
            
            for path in disallowed_paths:
                if url.endswith(path.strip()) or url.endswith('/' + path.strip()):
                    return False
    
    return True

if is_allowed("https://www.example.com"):
    page = session.get("https://www.example.com/some-page.html")
    
    # Process the HTML page
    # ...

In the above code, we first define a function is_allowed(url) that checks whether the given URL is allowed to be scraped based on the rules defined in the robots.txt file. If the URL is allowed, we make an HTTP request using the session.get() function and proceed with processing the HTML page.

The function checks the robots.txt file by appending “/robots.txt” to the URL and retrieves its contents. If the file is not found (404 status code), we consider it as allowed. If the file is found, we parse its contents and check if our user_agent is allowed to access the given URL path. We also check for any disallowed paths and ensure our URL doesn’t match any of them to maintain compliance.

By using this approach, we can ensure our web scraping activities are aligned with the rules set by website administrators.

Conclusion

In this blog post, we learned about complying with the robots.txt rules using the requests-html library in Python. It is essential to respect these rules to maintain good ethical practices and avoid any penalties or legal consequences. By incorporating the code provided, you can easily implement this compliance mechanism in your web scraping projects.

Remember, web scraping should always be done responsibly, within the bounds of the website’s terms and conditions, and in compliance with applicable laws and regulations.