今天研究一个爬虫程序,需求大概是这样的:

使用python编程,完成下面的任务:

1)爬取“智联招聘”网站(zhaopin.com)

2)数据范围为一个特定职位(如“互联网IT”下的“数据分析师”),URL为: https://www.zhaopin.com/sou/jl489/kwCLO66RII0PJP0NG8

3)在至少十个城市(每个城市100条数据) 爬取职位信息中的有用信息,包括:职位名称 、薪酬待遇、工作经验要求、学位学历要求等

4)数据保存到csv文件中

1、初步尝试

人工智能时代,现在不用AI辅助编程肯定是不应该的。把这些提示词向ChatGPT发问,马上得到了一个回答还有源代码。AI推荐使用Selenium抓取数据。回答中还告诉我了各个城市的数字编码。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

# 城市编码映射(可按需扩展)
city_dict = {
    "北京": "530", "上海": "538", "广州": "763", "深圳": "765",
    "杭州": "653", "成都": "801", "南京": "635", "天津": "531",
    "武汉": "736", "西安": "854"
}

# 初始化 Chrome 浏览器
def init_browser():
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # 无头模式
    chrome_options.add_argument("--disable-gpu")
    driver = webdriver.Chrome(options=chrome_options)
    return driver

# 爬取某城市的职位信息
def scrape_city_jobs(city_name, city_code, max_jobs=100):
    print(f"正在爬取 {city_name} 的数据分析师岗位...")
    driver = init_browser()
    jobs = []
    page = 1
    while len(jobs) < max_jobs:
        url = f"https://www.zhaopin.com/sou/jl{city_code}/kw数据分析师/pn{page}"
        driver.get(url)
        time.sleep(3)

        job_cards = driver.find_elements(By.CSS_SELECTOR, ".joblist-box__item")
        if not job_cards:
            print(f"{city_name} 第 {page} 页没有更多数据")
            break

        for card in job_cards:
            try:
                title = card.find_element(By.CSS_SELECTOR, ".job-name").text
                salary = card.find_element(By.CSS_SELECTOR, ".job-salary").text
                tags = card.find_elements(By.CSS_SELECTOR, ".job-info span")
                experience = tags[0].text if len(tags) > 0 else ""
                education = tags[1].text if len(tags) > 1 else ""

                jobs.append({
                    "城市": city_name,
                    "职位名称": title,
                    "薪酬待遇": salary,
                    "工作经验": experience,
                    "学历要求": education
                })
            except Exception as e:
                print(f"解析失败: {e}")
                continue

            if len(jobs) >= max_jobs:
                break
        page += 1

    driver.quit()
    return jobs

# 主流程
def main():
    all_jobs = []
    for city, code in city_dict.items():
        jobs = scrape_city_jobs(city, code, 100)
        all_jobs.extend(jobs)

    df = pd.DataFrame(all_jobs)
    df.to_csv("数据分析师职位信息.csv", index=False, encoding="utf-8-sig")
    print("已保存到 数据分析师职位信息.csv")

if __name__ == "__main__":
    main()

没那么幸运,程序当然是无法运行的,在find_elements那里得不到返回结果。

2、job_cards为空

追问GPT为什么得不到职位信息,返回了这些思路:

1)智联招聘的大部分职位数据是通过 JavaScript 动态加载的,普通的 Selenium.get() 页面加载后,页面上还没渲染出完整内容,导致 find_elements 返回空。

2)推荐采用 WebDriverWait 等待动态加载完毕,我并没有尝试这个技术。

3)使用https://fe-api.zhaopin.com 提供的API。试用后发现这个API过时了,已经被智联屏蔽了。

3、研究网页的返回内容

查看HTML源码,发现所有招聘信息出现在 <div class=“joblist-box__item clearfix joblist-box__item-unlogin”> 这个元素之后,询问AI如何用find_element()定位到它,尝试了AI给出的两种办法,但还是报错。

job_cards = driver.find_elements(By.CSS_SELECTOR, ".joblist-box__item")
job_cards = driver.find_elements(By.XPATH, '//div[contains(@class, "joblist-box__item")]')

还尝试禁用了无头模式,但仍然没有得到正确结果。

多轮次询问AI,从AI的回答中得到了另外一种思路playwright。

4、采用playwright方案

首先需要安装相关环境,第二条命令playwright install容易被忽略。如果没有安装好环境,python程序运行的时候会报错,而且错误信息“It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead.”非常古怪。

pip install playwright
playwright install

AI还给出一段异步调用代码,当然并没有一次成功。

import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import pandas as pd
import time

async def fetch_zhaopin_jobs():
    url = "https://www.zhaopin.com/sou/jl530/kw数据分析师"  # 以北京为例

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)  # 建议调试时用 False
        page = await browser.new_page()

        await page.goto(url)
        await page.wait_for_selector('#positionList-hook', timeout=10000)

        # 滚动到底部,触发懒加载
        for i in range(10):
            await page.mouse.wheel(0, 3000)
            await asyncio.sleep(1.5)

        html = await page.content()
        await browser.close()

    # 使用 BeautifulSoup 解析职位卡片
    soup = BeautifulSoup(html, 'html.parser')
    job_cards = soup.select('div.joblist-box__item')

    print(f"共提取到 {len(job_cards)} 条职位")

    jobs = []
    for card in job_cards:
        title = card.select_one(".jobinfo__name")
        salary = card.select_one(".jobinfo__salary")
        info_items = card.select(".jobinfo__other-info-item")

        job = {
            "职位名称": title.text.strip() if title else "",
            "薪酬待遇": salary.text.strip() if salary else "",
            "工作经验": info_items[1].text.strip() if len(info_items) > 1 else "",
            "学历要求": info_items[2].text.strip() if len(info_items) > 2 else ""
        }
        jobs.append(job)

    # 保存到 CSV
    df = pd.DataFrame(jobs)
    df.to_csv("智联招聘_数据分析师_playwright.csv", index=False, encoding="utf-8-sig")
    print("✅ 数据已保存到 CSV 文件。")

# 运行任务
asyncio.run(fetch_zhaopin_jobs())

5、重新分析HTML页面

AI并不知道招聘页面的内容,所以需要给AI进行明确的提示。我把浏览器上返回的HTML源代码直接问AI,提示词大概是这样的:

下面是一段html的内容,后面还有其它内容,是一些职位招聘信息,我想使用python+playwright编程,从这段文本里提取出招聘公司名称、招聘职位、薪酬待遇、学历要求这些数据,使用csv保存。 下面是HTML内容:

 <body> 
    <div id="root" data-server-rendered="true" class="app">
    <div id="best-employer-top-banner"></div> 
    <div class="header-nav">
    <div class="header-nav__main">
    <a href="https://www.zhaopin.com" target="_blank" class="header-nav__logo-box">
    <img src="//fecdn4.zhaopin.cn/www_zhaopin_com/sou/assets/logo.d1e4f0.png" alt="logo" class="header-nav__logo"></a> 
    <ul class="header-nav__ul">
    <li class="header-nav__ul__item">
    <a href="https://www.zhaopin.com" target="_blank" class="header-nav__ul__item__a">
           <div class="filter-box__main">
           <div class="query-search">
           <div class="query-search__content">
           <div class="query-search__content-input__wrap"><input type="text" placeholder="输入职位、公司等搜索" value="数据分析师" class="query-search__content-input"> 
           ………………这里省略了1万字………………。
        <div class="companyinfo__staff-name">
          XX女士·招聘
        </div>

这次AI相当给力,竟然给出了一个正确程序,为了保险,我设置了headless=False。

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd

def scrape_job_info(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, timeout=60000)

        # 等待职位列表加载出来
        page.wait_for_selector('.positionlist__list-unlogin')
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, 'html.parser')
    job_items = soup.select('.joblist-box__item-unlogin')

    data = []

    for job in job_items:
        try:
            job_title = job.select_one('.jobinfo__name').get_text(strip=True)
            salary = job.select_one('.jobinfo__salary').get_text(strip=True)
            education = job.select('.jobinfo__other-info-item')[-1].get_text(strip=True)
            company = job.select_one('.companyinfo__name').get_text(strip=True)

            data.append({
                '职位名称': job_title,
                '薪酬待遇': salary,
                '学历要求': education,
                '公司名称': company
            })
        except Exception as e:
            print("解析失败:", e)

    return data

# 示例用的URL(你可以替换成你实际访问的页面)
url = 'https://sou.zhaopin.com/?kw=数据分析师&jl=530'

job_data = scrape_job_info(url)

# 保存为 CSV
df = pd.DataFrame(job_data)
df.to_csv('job_list.csv', index=False, encoding='utf-8-sig')
print("已保存为 job_list.csv")

6、最后的程序

能够成功爬取一页的招聘信息,然后就是遍历几个城市,每个城市取几页信息即可。

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import csv


# 先安装playwright, 使用命令: pip install playwright
# 然后还要执行一个命令:playwright install
# 再执行本python程序

def scrape_job_info(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page(viewport={'width': 300, 'height': 200})
        page.goto(url, timeout=60000) #可以根据网络速度调整这个值

        # 等待职位列表加载出来
        page.wait_for_selector('.positionlist__list-unlogin')
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, 'html.parser')
    job_items = soup.select('.joblist-box__item-unlogin')
    print(f'找到 {len(job_items)} 个职位')
    data = []

    for job in job_items:
        try:
            job_title = job.select_one('.jobinfo__name').get_text(strip=True)
            salary = job.select_one('.jobinfo__salary').get_text(strip=True)
            education = job.select('.jobinfo__other-info-item')[-1].get_text(strip=True)
            company = job.select_one('.companyinfo__name').get_text(strip=True)

            data.append({
                '职位名称': job_title,
                '薪酬待遇': salary,
                '学历要求': education,
                '公司名称': company
            })
        except Exception as e:
            print("解析失败:", e)

    return data



zhaopin_city_codes = {
    "北京": 530,
    "上海": 538,
    "广州": 763,
    "深圳": 765,
    "杭州": 653,
    "成都": 801,
    "武汉": 736,
    "南京": 635,
    "苏州": 639,
    "天津": 531,
    "重庆": 551,
    "西安": 854,
}


PAGES = 5  #每个城市爬取几页信息?

all_jobs = []

for city, city_code in zhaopin_city_codes.items() :
    print(city, city_code)
    for page in range(1, PAGES+1)  :
        url = f"https://www.zhaopin.com/sou/jl{city_code}/kwCLO66RII0PJP0NG8/p{page}"
        job_data = scrape_job_info(url)
        all_jobs.extend(job_data)

# 将所有职位信息写入 CSV 文件
csv_file = 'job_list.csv'
fieldnames = ['职位名称', '薪酬待遇', '学历要求', '公司名称']

with open(csv_file, 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(all_jobs)

print(f"已保存所有 {len(all_jobs)} 条职位信息到 {csv_file}")

7、几点经验

  • AI辅助,节省大量的时间
  • 别指望AI一次给你正确的结果,必须还要有编程的基础知识,知道如何排错
  • 尝试把问题拆解为几个小问题,逐个击破
  • 把HTML页面的源文件给AI,让它帮你编程提取里面的关键数据,示例足够多,AI还是相当准确的
  • 了解Headless的意思
  • 现在的很多网站都是动态加载的,有些爬虫程序可以无法得到完整的HTML内容,估计selenium也是可行的,我没有继续尝试
  • playwright第一次使用的时候,认真读一下安装步骤
  • 应该还可以优化为异步调用,可以一次多打开几个页面,让爬取速度更快,但别调用太频繁