[파이썬] requests-html 메타 데이터 추출

In web scraping, extracting metadata from a web page can provide valuable information about the page’s content, authorship, and other relevant details. With the help of the requests-html library in Python, we can easily extract metadata from HTML documents.

Installing requests-html

Before diving into the code, make sure you have the requests-html library installed. You can install it using pip:

pip install requests-html

Extracting Metadata using requests-html

To extract metadata from an HTML document using requests-html, follow these steps:

  1. Import the necessary modules:
from requests_html import HTMLSession
  1. Create an instance of the HTMLSession class:
session = HTMLSession()
  1. Make a GET request to the desired URL:
url = "https://example.com"  # Replace with the actual URL
response = session.get(url)
  1. Extract the metadata from the HTML document:
metadata = response.html.xpath('//meta')

Here, we use xpath to select all <meta> elements present in the HTML document.

  1. Print or manipulate the extracted metadata:
for meta in metadata:
    print(meta.attrs)

By accessing the attrs property of each <meta> element, we can retrieve the metadata attributes.

  1. Close the session to free up resources:
session.close()

Example

Let’s put it all together in an example that extracts metadata from the Google homepage:

from requests_html import HTMLSession

session = HTMLSession()
url = "https://www.google.com"
response = session.get(url)

metadata = response.html.xpath('//meta')

for meta in metadata:
    print(meta.attrs)

session.close()

When you run this code, you’ll see the metadata attributes printed in the console.

Conclusion

Using the requests-html library, extracting metadata from HTML documents becomes a straightforward task in Python. By following the steps outlined above, you can easily retrieve useful information about web pages and use it for various purposes. Happy scraping!