最近由于工作的需要,独自开始研究爬虫爬取互联网数据,经过一段时间的探究,踩过许多坑,也学习到了许多以往不知道的知识。在这里总结一下经验,顺便分享给大家,希望可以帮助到有需要的朋友,当然如果有爬虫大佬能够不吝赐教那就更好啦。 大部分人都是使用的python来实现爬虫的,因为自己学的是java,也没更多时间去学习新的语言了,所以还是选择了用java来实现。本篇爬虫技术分享是用java来实现了爬取百度的搜索结果 ,java的使用如下: - import org.apache.commons.httpclient.Credentials;
- import org.apache.commons.httpclient.HostConfiguration;
- import org.apache.commons.httpclient.HttpClient;
- import org.apache.commons.httpclient.HttpMethod;
- import org.apache.commons.httpclient.HttpStatus;
- import org.apache.commons.httpclient.UsernamePasswordCredentials;
- import org.apache.commons.httpclient.auth.AuthScope;
- import org.apache.commons.httpclient.methods.GetMethod;
- import java.io.IOException;
- public class Main {
- # 代理服务器(产品官网 www.16yun.cn)
- private static final String PROXY_HOST = "t.16yun.cn";
- private static final int PROXY_PORT = 31111;
- public static void main(String[] args) {
- HttpClient client = new HttpClient();
- HttpMethod method = new GetMethod("https://httpbin.org/ip");
- HostConfiguration config = client.getHostConfiguration();
- config.setProxy(PROXY_HOST, PROXY_PORT);
- client.getParams().setAuthenticationPreemptive(true);
- String username = "16ABCCKJ";
- String password = "712323";
- Credentials credentials = new UsernamePasswordCredentials(username, password);
- AuthScope authScope = new AuthScope(PROXY_HOST, PROXY_PORT);
- client.getState().setProxyCredentials(authScope, credentials);
- try {
- client.executeMethod(method);
- if (method.getStatusCode() == HttpStatus.SC_OK) {
- String response = method.getResponseBodyAsString();
- System.out.println("Response = " + response);
- }
- } catch (IOException e) {
- e.printStackTrace();
- } finally {
- method.releaseConnection();
- }
- }
- }
复制代码
在学习的过程中我也遇到了一些网站的反爬机制,像User-Agent限制,限制IP访问次数,还有验证码等。这些反爬机制比较简单解决的也有很难解决的,像限制ip访问我们可以直接通过购买高质量代理ip来解决。比如示例里面使用的亿牛云爬虫代理。难度大就需要更深入的学习才能解决了。
|