# Web Scraping with Python

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/python](https://feng.li/python)

# What Is Web Scraping?

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.


In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.

# Your First Web Scraper

## Let's try the toy first

In [1]:
from urllib.request import urlopen
html = urlopen('https://feng.li/python/')
print(html.read())

b'<!doctype html>\n<html lang="en-US" class="respect-color-scheme-preference">\n<head>\n\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<title>Python\xe7\xa8\x8b\xe5\xba\x8f\xe8\xae\xbe\xe8\xae\xa1 &#8211; Dr. Feng Li</title>\n<meta name=\'robots\' content=\'max-image-preview:large\' />\n<link rel=\'dns-prefetch\' href=\'//s.w.org\' />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Feed" href="https://feng.li/feed/" />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Comments Feed" href="https://feng.li/comments/feed/" />\n\t\t<script>\n\t\t\twindow._wpemojiSettings = {"baseUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/13.1.0\\/72x72\\/","ext":".png","svgUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/13.1.0\\/svg\\/","svgExt":".svg","source":{"concatemoji":"https:\\/\\/feng.li\\/wordpress\\/wp-includes\\/js\\/wp-emoji-release.min.js?ver=5.8.2"}};\n\t\t\t!function

The above doesn’t look so great. Below is better.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs)

<!DOCTYPE html>

<html class="respect-color-scheme-preference" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Python程序设计 – Dr. Feng Li</title>
<meta content="max-image-preview:large" name="robots"/>
<link href="//s.w.org" rel="dns-prefetch"/>
<link href="https://feng.li/feed/" rel="alternate" title="Dr. Feng Li » Feed" type="application/rss+xml"/>
<link href="https://feng.li/comments/feed/" rel="alternate" title="Dr. Feng Li » Comments Feed" type="application/rss+xml"/>
<script>
			window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.1.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.1.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/feng.li\/wordpress\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.8.2"}};
			!function(e,a,t){var n,r,o,i=a.createElement("canvas"),p=i.getContext&&i.getContext("2d");function s(e,t){var a=String.fromCharCode;p.cl

## The complete case

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
nameList = bs.findAll('div', {'class':'entry-content'})
for name in nameList:
    print(name.get_text())




Contents1 课程简介2 授课教师3 参考书4 讲课视频5 幻灯片
课程简介
Python程序设计是面向财经和统计专业学生开设的一门以应用为主的编程课程，该课程最早由李丰老师在中央财经大学以公开讲座的形式开设，后成为中央财经大学金融、会计和MBA项目的核心课程。
授课教师


李丰博士现任中央财经大学统计与数学学院副院长、副教授、硕士生导师。博士毕业于瑞典斯德哥尔摩大学，研究领域包括贝叶斯统计学，预测方法，大数据分布式学习等。曾获瑞典皇家统计学会 Cramér 奖，国际贝叶斯学会青年奖励基金， 第二届全国高校经管类实验教学案例大赛二等奖。主持和参与多项国家自然科学基金项目。
李丰博士最新研究成果发表在统计期刊 Journal of Computational and Graphical Statistics，Journal of Business and Economic Statistics, Statistical Analysis and Data Mining，经济与管理学期刊 International Journal of Forecasting，Journal of Business Research，运筹学期刊European Journal of Operational Research, Journal of the Operational Research Society，人工智能期刊 Expert Systems with Applications，医学期刊 BMJ Open, Journal of Surgical Research, Journal of Affective Disorders等。同时著有 Bayesian Modeling of Conditional Densities，《大数据分布式计算与案例》和《统计计算》。


参考书
Python可以被广泛地使用在财经领域，以下列出一些零基础书目。
类别书名中译本数据分析Python for Data Analysis (by Wes McKinney)利用Python进行数据分析（原书第2版）数据抓取Web Scraping with Python: Collecting More Data from the Modern Web (by Ryan Mitchel

## Web Scraping with `BeautifulSoup`

Let's start with this page

https://finance.eastmoney.com/a/cgnjj_1.html

In [1]:
import logging
import requests
import sys
import urllib

from bs4 import BeautifulSoup
from collections import OrderedDict
from urllib.parse import urlencode

page = 1

# Set a User agent to tell the remote we are human not machines
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'}

href = 'https://finance.eastmoney.com/a/cgnjj_%s.html' %page
html = requests.get(href,headers=headers)

In [2]:
# Check the request headers
html.request.headers

{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [3]:
# Check the html status
html.status_code

200

In [4]:
# Parsing html
soup = BeautifulSoup(html.content, 'html.parser')
soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--published at 2022/1/18 9:59:11 by www.eastmoney.com ZP NEWS 51-->
<html>
<head>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="webkit" name="renderer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>国内经济 _ 东方财富网</title>
<link href="Style/Layout?v=D-Lx7AIA9yzpTaxrfzY510uaHfFcF-f4wjxjND2x6AM1" rel="stylesheet" type="text/css"/>
<link href="Style/Module/ModuleStyle?v=Hg8E__Husi8eCaIDYam-IlV9PhIrIxKthXakK1TZGko1" rel="stylesheet" type="text/css"/>
<link href="Style/List?v=eGUa4FK6efrUwjDP3ziCyzxGwcRa659KOU2VShvbQco1" rel="stylesheet" type="text/css"/>
<link href="favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<base target="_blank"/>
</head>
<body style="margin-top:43px">
<div style="background-color:#fff;width:1000px;margin:0 auto;">
<img id="weixin-share" src="//cmsjs.eastmoney.com/common/weixi

In [5]:
divs = soup.findAll('ul', {"id": "newsListContent"})
divs

[<ul id="newsListContent">
 <li id="newsTr0">
 <!--文-->
 <div class="text text-no-img">
 <p class="title">
 <a href="http://finance.eastmoney.com/a/202201182251398069.html" target="_blank">
                     上海市发展和改革委员会副主任王华杰：上海要适度超前开展基础设施投资
                 </a>
 </p>
 <p class="info" title="上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优势、壮大未来发展动能的内在要求。目前，上海适度超前开展基础设施投资已有良好的工作基础。2021年1月至11月，上海市城市基础设施增速9.3%，两年平均增速高于全国平均水平。">
                     上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优...
                 </p>
 <p class="time">
                 01月18日 09:50
             </p>
 </div>
 <!--分享-->
 <!--
         <div class="share">
             <span class="shareIco"></span>
             <div class="shareContent">
                 <div id='bdshare' class="bdshare_t bds_tools get-codes-bdshare" data="{'url':'http://finance.eastmoney.com/a/202201182251398069.html',text:'上海市发展和

In [6]:
divs = soup.findAll('div', {"class": "text text-no-img"})
divs

[<div class="text text-no-img">
 <p class="title">
 <a href="http://finance.eastmoney.com/a/202201182251398069.html" target="_blank">
                     上海市发展和改革委员会副主任王华杰：上海要适度超前开展基础设施投资
                 </a>
 </p>
 <p class="info" title="上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优势、壮大未来发展动能的内在要求。目前，上海适度超前开展基础设施投资已有良好的工作基础。2021年1月至11月，上海市城市基础设施增速9.3%，两年平均增速高于全国平均水平。">
                     上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优...
                 </p>
 <p class="time">
                 01月18日 09:50
             </p>
 </div>,
 <div class="text text-no-img">
 <p class="title">
 <a href="http://finance.eastmoney.com/a/202201182251394035.html" target="_blank">
                     央行1月18日开展1000亿元逆回购操作
                 </a>
 </p>
 <p class="info">
                     人民银行网站消息，为维护银行体系流动性合理充裕，2022年1月18日人民银行以利率招标方式开展了1000亿元逆回购操作。
                 </p>
 <p class="time">
    

In [7]:
import csv
newsData =  open("data/newsData.csv", 'w')
csv_writer = csv.writer(newsData, delimiter="\001")
for div in divs:
    # News title
    titleinfo = div.find('a')
    title = titleinfo.get_text().strip()
    # News url
    url = titleinfo['href']
    # News abstract
    abstract = div.find('p', {"class": "info"}).get_text().strip()
    # Time
    time = div.find('p', {"class": "time"}).get_text().strip()
    print([title, time, abstract, url])
    csv_writer.writerow([title, time, abstract, url])
newsData.close()


['上海市发展和改革委员会副主任王华杰：上海要适度超前开展基础设施投资', '01月18日 09:50', '上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优...', 'http://finance.eastmoney.com/a/202201182251398069.html']
['央行1月18日开展1000亿元逆回购操作', '01月18日 09:49', '人民银行网站消息，为维护银行体系流动性合理充裕，2022年1月18日人民银行以利率招标方式开展了1000亿元逆回购操作。', 'http://finance.eastmoney.com/a/202201182251394035.html']
['国家统计局：2021年信息传输、软件和信息技术服务业GDP比上年增长17.2%', '01月18日 09:46', '国家统计局1月18日消息，根据有关基础资料和国民经济核算方法，我国2021年四季度和全年国内生产总值(以下简称GDP)初步核算结果公布，去年四季度信息传输、软件和信息技术服务业GDP同比增长11.5%...', 'http://finance.eastmoney.com/a/202201182251381338.html']
['创业黑马牛文文：“专精特新+北交所”指明了“黄金之路”', '01月18日 09:42', '近日，由创业黑马主办的第14届创业家年会在线上举行，数十位国内投资人、产业专家、企业家和创业者在“云端”出席，以“聚焦专精特新、坚持重度垂直”为主题，共同探讨商业未来。多位嘉宾表示，广大中小企业正迎来...', 'http://finance.eastmoney.com/a/202201182251357983.html']
['上海市住房城乡建设管理委副主任朱剑豪：2022年上海市重大工程计划完成投资2000亿元以上', '01月18日 09:42', '上海市住房城乡建设管理委副主任朱剑豪1月18日在上海市政府新闻发布会上表示，多年来，根据上海市委、市政府安排，上海始终保持每年推进百余项市重大项目建设，涵盖科技产业、社会民生、生态文明、城市基础设施、...', 'http:/

In [8]:
import requests
import sys 

from bs4 import BeautifulSoup


def get_body(href):
    """Function to retrieve news content given its url.

    Args:
        href: url of the news to be crawled.

    Returns:
        content: the crawled news content.

    """
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'}
    html = requests.get(href, headers=headers)
    soup = BeautifulSoup(html.content, 'html.parser')
    div = soup.find('div', {"id": "ContentBody"})
    paras = div.findAll('p')
    content = ''
    for p in paras:
        ptext = p.get_text().strip().replace("\n", "")
        content += ptext
    return content



if __name__ == "__main__":
    # Getting and printing content for each url in the crawled web list pages
    with open("data/newsData.csv") as f:
        for line in f:
            title, date, abstract, href = line.strip().split('\001')
            # Printing progress onto console
            print('Scraping ' + href)
            content = get_body(href)
            print('\001'.join([title, date, abstract, href, content]))

Scraping http://finance.eastmoney.com/a/202201182251398069.html
上海市发展和改革委员会副主任王华杰：上海要适度超前开展基础设施投资01月18日 09:50上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优...http://finance.eastmoney.com/a/202201182251398069.html上海市政府1月18日举行新闻发布会，上海市发展和改革委员会副主任王华杰介绍了上海适度超前开展基础设施投资的有关情况。王华杰表示，适度超前开展基础设施投资是扩大有效投资的一项重要内容，也是强化既有经济优势、壮大未来发展动能的内在要求。目前，上海适度超前开展基础设施投资已有良好的工作基础。2021年1月至11月，上海市城市基础设施增速9.3%，两年平均增速高于全国平均水平。经过多年大规模投入，上海已经基本建成“枢纽型、功能性、网络化”基础设施体系。王华杰说，围绕中央经济工作会议提出的适度超前开展基础设施投资有关要求，上海市已做相关部署推进此项工作，主要包括三个方面：一是坚持规划引领。发挥上海市“十四五”各专项规划发布早、项目储备足的优势，持续推进提升交通、能源、生态环境等传统基础设施能级，提高数字化、智能化、绿色低碳水平，同步加快布局服务新业态、培育新动能、支撑新赛道的新型基础设施。二是坚持远近结合。针对短板弱项，精准发力，积极推进重点领域和薄弱环节项目建设，既扩大当前有效投资，补短板、强弱项，又增强发展后劲。三是坚持“时、度、效”相统一。聚焦关系城市经济社会发展，早晚都要干的重大项目，经科学论证后，提前到“十四五”前期实施。王华杰表示，最新出炉的《2022年上海市扩大有效投资稳定经济发展的若干政策措施》(以下简称《政策措施》)已对适度超前开展基础设施投资有了较为系统的筹划。在项目安排上，《政策措施》提出重点支持国家重大战略全面落实，助力城市基础设施提质增效，支撑韧性城市建设。在资金和要素保障上，《政策措施》提出加强市级建设财力、土地出让收入、政府债券等资金统筹，加强地、水、绿、林、土、房等资源性指标保障，确保要素跟着项目走。（文章来源：上

600员工居“嘉”过年 “报喜鸟”百万补助留温人员01月18日 09:3160多桌除夕分岁宴席、200元压岁红包、职业生活技能培训、免费用餐和住宿……临近过年，报喜鸟公司开出了丰富的“留温过年”福利清单，吸引了企业近600名员工留在永嘉过年。当前部分城市疫情防控严峻，为助推...http://finance.eastmoney.com/a/202201182251311760.html60多桌除夕分岁宴席、200元压岁红包、职业生活技能培训、免费用餐和住宿……临近过年，报喜鸟公司开出了丰富的“留温过年”福利清单，吸引了企业近600名员工留在永嘉过年。当前部分城市疫情防控严峻，为助推永嘉经济“开门稳”，报喜鸟在持续推动自身生产经营工作、做好防疫防控工作之外，积极履行企业社会责任，倡导员工留温过年——通过多项举措，鼓励外来员工安“家”不离“嘉”。春节期间，公司计划投入100多万元补助留温员工，涵盖吃、住、培训、文体娱乐等多个方面。除夕当天，公司会设立分岁宴席，邀请全体留温员工及家属一同参加。公司还为留温员工开展职业技能和生活技能培训。和报喜鸟相同，永嘉县其他企业也用实招留人。纽顿流体科技有限公司除给予员工适当的现金奖励外，他们可在春节之后享受7天的探亲假；永嘉县国明橡塑有限公司规定在正月初十之前报到的，可以得到1300元补贴，另外还有700元的带薪休假及岁末慰问等多重福利。此外，永嘉县政府也在持续为新春大礼包“加码”。在留岗补助上，永嘉在全市标准外，新增20人以上企业补助2万元的举措。对于总投资超过1亿元的重大项目企业，春节期间外地留温员工人数50人以上、100人以上的，分别给予5万元、10万元补助。（文章来源：温州日报）
Scraping http://finance.eastmoney.com/a/202201182251219557.html
2021年汽车制造业工业增加值同比稳定增长01月18日 09:26中汽协发布数据，2021年，汽车产销双双超过2600万辆，结束了自2018年以来连续三年下降的局面。汽车制造业工业增加值同比也保持稳定增长，且增速高于同期产销。http://finance.eastmoney.com/a/202201182251219557.html中汽协发布数据，2021年，汽车产销双双超过2600万辆，结束了自2018年以

## Web Crawling with `Scrapy`*

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you.

###  Installing Scrapy

- After Anaconda is installed, you can install Scrapy by using this command:
   
      conda install -c conda-forge scrapy

### Dealing with Different Website Layouts

Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.


## Initializing a New Spider

To create a new spider in the current directory, run the following from the **command line (NOT THE PYTHON PROMPT)**:
```
    scrapy startproject wikiSpider
```    
    
This creates a new subdirectory in the directory the project was created in, with the title wikiSpider. Inside this directory is the following file structure:

- scrapy.cfg
- wikiSpider
  - spiders
     - __init.py__
  - items.py
  - middlewares.py
  - pipelines.py
  - settings.py
  - __init.py__

### Generate some spiders with templates from the command line

    scrapy genspider example example.com 
    scrapy genspider example2 example.com 
    scrapy genspider example3 example2.com 

### Writing a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at wikiSpider/wikiSpider/spiders/article.py. In your newly created **article.py** file, write the following:

```python
    import scrapy

    class ArticleSpider(scrapy.Spider):
        name='article'

        def start_requests(self):
            urls = [
                'http://en.wikipedia.org/wiki/Python_%28programming_language%29',
                'https://en.wikipedia.org/wiki/Functional_programming',
                'https://en.wikipedia.org/wiki/Monty_Python']
            return [scrapy.Request(url=url, callback=self.parse) for url in urls]

        def parse(self, response):
            url = response.url
            title = response.css('h1::text').extract_first()
            print('URL is: {}'.format(url))
            print('Title is: {}'.format(title))
```

### Run this article spider

You can run this article spider by navigating to the wikiSpider/wikiSpider directory and running from the command line:

    scrapy runspider article.py
        
### Run your project with at the project root directory

    scrapy crawl table -o table.csv  --logfile table.log
    

### Scrapy Shell

To do the crawler interactively, just run from the command line

```bash
scrapy shell "http://en.wikipedia.org/wiki/Python_%28programming_language%29"
```

# Lab 

Use `scrapy` framework to implement the we studied with `BeautifulSoup`