[파이썬] Scrapy Request와 Response 객체

Scrapy is a powerful and flexible web scraping framework written in Python. It provides developers with tools to efficiently extract data from websites. In this blog post, we will explore Scrapy’s Request and Response objects and see how they can be used in web scraping projects.

Scrapy Request Object

The Request object is used to specify the URL(s) to be scraped and provide additional information such as headers, cookies, and meta data. It represents an HTTP request to be sent to a website.

To create a Scrapy Request object, you can use the following syntax:

from scrapy import Request

request = Request(url='http://www.example.com', callback=self.parse)

Here, we create a Request object by passing the URL and a callback function to handle the response. The callback function is where we define how to process the web page once the request is completed.

Additionally, we can specify other parameters such as headers, form data, cookies, and more. For example:

request = Request(
    url='http://www.example.com',
    callback=self.parse,
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'},
    cookies={'key1': 'value1', 'key2': 'value2'},
    meta={'category': 'books'}
)

In the above example, we add custom headers to mimic a web browser, set cookies, and include meta data that can be accessed in the callback function.

Scrapy Response Object

The Response object represents the web page retrieved as a result of sending a Request. It contains the HTML content, response headers, status codes, and more.

After sending a request using yield request, the parse method is invoked with the Response object. The Response object can be used to extract data using XPath or CSS selectors, handle pagination, follow links, and more.

Here’s an example of how to extract data using Scrapy’s Response object:

def parse(self, response):
    title = response.css('h1::text').get()
    description = response.xpath('//div[@class="description"]/text()').get()
    yield {
        'title': title,
        'description': description
    }

In the above code snippet, we extract the title and description using CSS and XPath selectors from the web page represented by the Response object. We then yield a Python dictionary containing the extracted data.

Conclusion

The Scrapy Request and Response objects are essential components when performing web scraping with Scrapy. They allow you to send HTTP requests, specify additional parameters, and handle the response. By using these objects effectively, you can build robust and efficient web scraping pipelines.

Scrapy’s extensive documentation provides further details on how to utilize these objects and other advanced features. Happy scraping!