In web scraping, extracting metadata from a web page can provide valuable information about the page’s content, authorship, and other relevant details. With the help of the requests-html library in Python, we can easily extract metadata from HTML documents.
Installing requests-html
Before diving into the code, make sure you have the requests-html library installed. You can install it using pip:
pip install requests-html
Extracting Metadata using requests-html
To extract metadata from an HTML document using requests-html, follow these steps:
- Import the necessary modules:
from requests_html import HTMLSession
- Create an instance of the
HTMLSession
class:
session = HTMLSession()
- Make a GET request to the desired URL:
url = "https://example.com" # Replace with the actual URL
response = session.get(url)
- Extract the metadata from the HTML document:
metadata = response.html.xpath('//meta')
Here, we use xpath
to select all <meta>
elements present in the HTML document.
- Print or manipulate the extracted metadata:
for meta in metadata:
print(meta.attrs)
By accessing the attrs
property of each <meta>
element, we can retrieve the metadata attributes.
- Close the session to free up resources:
session.close()
Example
Let’s put it all together in an example that extracts metadata from the Google homepage:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.google.com"
response = session.get(url)
metadata = response.html.xpath('//meta')
for meta in metadata:
print(meta.attrs)
session.close()
When you run this code, you’ll see the metadata attributes printed in the console.
Conclusion
Using the requests-html library, extracting metadata from HTML documents becomes a straightforward task in Python. By following the steps outlined above, you can easily retrieve useful information about web pages and use it for various purposes. Happy scraping!