Java 简单爬虫代码-白红宇

Java 简单爬虫代码

阅读量：2208 次

发布时间：2019-05-04

本文共 2156 字，大约阅读时间需要 7 分钟。

这里只演示最简单的一个爬虫

准备：需要导入一个jar包>>

（jsoup 是用于爬虫的一个框架，除此之外的还有jSpider、HTMLUnit 、Jaunt）

import java.io.IOException;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;class SimpleSpider{
   	Document getDoc(String url) throws IOException {
   		Document doc=Jsoup.connect(url).get();		return doc;	}	Elements getElementAs(Document doc) {
   //cssQuery syntax:https://jsoup.org/apidocs/org/jsoup/select/Selector.html		Elements a=doc.select("a[href]");//finds links (a tags with href attributes)								return a;	}}public class SimpleOne {
   	public static void main(String[] args) throws IOException {
   		SimpleSpider s=new SimpleSpider();		Document doc=s.getDoc("https://www.baidu.com/");		//System.out.println(s.getElementAs(doc));		Elements aSet=s.getElementAs(doc);		for(Element i :aSet) {
   			System.out.println(i.attr("href"));//get attr href		}		System.out.println("end");	}}

我现在也只是刚接触，发现利用这些框架的话，其实Java 爬虫看起来也没那么冗余

网页解析起来也不复杂

不过，你如果不借助这些框架的话。。

import java.io.BufferedInputStream;import java.io.DataInputStream;import java.io.IOException;import java.io.InputStream;import java.net.MalformedURLException;import java.net.URL;public class WithNoExtraJar {
   	public static void main(String[] args) {
   		// TODO Auto-generated method stub		URL url;	    InputStream is = null;	    DataInputStream dis;	    String line;	    try {
   	        url = new URL("https://www.baidu.com/");	        is = url.openStream();  // get connection	        dis = new DataInputStream(new BufferedInputStream(is));	        while ((line = dis.readLine()) != null) {
   	            System.out.println(line);	        }	    } catch (MalformedURLException mue) {
   	         mue.printStackTrace();	    } catch (IOException ioe) {
   	         ioe.printStackTrace();	    } finally {
   	        try {
   	            is.close();	        } catch (IOException ioe) {
   	            // nothing to see here	        }	    }	}}

上面只是获取网页源代码的程序，都还没开始解析，就已经一大坨了，看了就头大

所以，还是用框架好。。

除了Java，还有挺多语言也可以写爬虫的

比如说：

Ruby、node.js、python、C++、php

各有特点吧，不过现在最火的是python

Java 的话，我个人认为，多线程不赖了

（Google和百度都查了，真的找不到有人对Java爬虫的评价，噗好像是故意被冷落了）

转载地址：http://moiyb.baihongyu.com/

你可能感兴趣的文章

Oracle PL/SQL语言初级教程之过程和函数

查看>>