[java] HttpClient를 사용하여 웹 사이트의 모든 링크를 추출하는 방법은?

16 Nov 2023

java

이 문서에서는 Java의 HttpClient를 사용하여 웹 사이트에서 모든 링크를 추출하는 방법에 대해 알아보겠습니다.

HttpClient 라이브러리 추가

먼저, HttpClient를 사용하기 위해 Maven 또는 Gradle 등을 통해 HttpClient 라이브러리를 프로젝트에 추가해야 합니다. 아래는 Maven을 사용하는 경우의 의존성 설정 예시입니다.

<dependencies>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.13</version>
    </dependency>
</dependencies>

모든 링크 추출하기

HttpClient를 사용하여 웹 사이트의 모든 링크를 추출하는 방법은 다음과 같습니다.

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WebsiteLinkExtractor {
    public static Set<String> extractLinks(String url) throws IOException {
        HttpClient httpClient = HttpClientBuilder.create().build();
        HttpGet httpGet = new HttpGet(url);
        HttpResponse response = httpClient.execute(httpGet);
        String html = EntityUtils.toString(response.getEntity());

        Pattern linkPattern = Pattern.compile("<a\\s+(?:[^>]*?\\s+)?href=([\"'])(.*?)\\1");
        Matcher matcher = linkPattern.matcher(html);

        Set<String> links = new HashSet<>();
        while (matcher.find()) {
            String link = matcher.group(2);
            if (!link.startsWith("http")) {
                links.add(url + link);
            } else {
                links.add(link);
            }
        }

        return links;
    }

    public static void main(String[] args) {
        String url = "http://example.com";
        try {
            Set<String> links = extractLinks(url);
            for (String link : links) {
                System.out.println(link);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

위의 코드는 주어진 URL에서 HTML을 가져와 정규식을 사용하여 모든 링크를 추출합니다. 추출된 링크는 Set에 저장되어 반환됩니다.

main 메서드에서는 웹 사이트의 URL을 입력하고, extractLinks 메서드를 호출하여 모든 링크를 추출한 후 출력합니다.

결론

HttpClient를 사용하여 Java에서 웹 사이트의 모든 링크를 추출하는 방법을 알아보았습니다. 이를 통해 웹 스크래핑이나 웹 크롤링과 같은 작업을 수행할 수 있습니다.