[파이썬] Scrapy Rule과 LinkExtractor

06 Sep 2023

Scrapy

Scrapy is a powerful Python framework for web scraping. It allows you to efficiently extract data from websites by defining rules and using built-in classes such as Rule and LinkExtractor. In this blog post, we will explore how to use Scrapy’s Rule and LinkExtractor to scrape data from web pages.

Rule

The Rule class in Scrapy allows you to define a set of rules for following links and extracting data. It takes several parameters, including the link extractor, callback function, and other optional parameters.

Here’s an example of defining a Rule:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow='/page/'), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        # Code to extract data from the page

In the above example, we define a Rule that uses the LinkExtractor class to extract links that have “/page/” in their URL. The callback function “parse_page” is then called on each extracted link.

LinkExtractor

LinkExtractor is a class that allows you to define how links are extracted from web pages. It accepts various parameters to define the pattern for extracting links.

Let’s see an example of how to use LinkExtractor:

from scrapy.linkextractors import LinkExtractor

# Create a LinkExtractor object with specific parameters
link_extractor = LinkExtractor(allow='/page/', deny=('logout',))

# Extract links from a web page
links = link_extractor.extract_links(response)

In the above example, we create a LinkExtractor object with “allow” parameter set to ‘/page/’, which means it will only extract links that have ‘/page/’ in their URL. We also provide a “deny” parameter to exclude links containing ‘logout’ in their URL.

After creating the LinkExtractor object, we can use the extract_links() method to actually extract links from a web page. The response parameter is the Scrapy response object representing the web page to extract links from.

Conclusion

Scrapy’s Rule and LinkExtractor are powerful tools for defining scraping rules and extracting links from web pages. By using these classes, you can easily navigate through websites and extract the desired data for your web scraping projects.