温馨提示: 豌豆仅提供国内节点,不提供境外节点,不能用于任何非法用途,不能访问境外网站及跨境联网。

免费领取1万IP!

「docker实战篇」python的docker-抖音web端数据抓取(19)

发布时间:

抖音抓取实战,为什么没有抓取数据?例如:有个互联网的电商生鲜公司,这个公司老板想在一些流量上投放广告,通过增加公司产品曝光率的方式,进行营销,在投放的选择上他发现了抖音,抖音拥有很大的数据流量,尝试的想在抖音上投放广告,看看是否利润和效果有收益。他们分析抖音的数据,分析抖音的用户画像,判断用户的群体和公司的匹配度,需要抖音的粉丝数,点赞数,关注数,昵称。通过用户喜好将公司的产品融入到视频中,更好的推广公司的产品。一些公关公司通过这些数据可以找到网红黑马,进行营销包装。源码:https://github.com/limingios/dockerpython.git (douyin)

抖音分享页面

  • 介绍

  • 安装谷歌xpath helper工具

源码中获取crx

谷歌浏览器输入:chrome://extensions/

直接将xpath-helper.crx 拖入界面chrome://extensions/

安装成功后

快捷键 ctrl+shift+x 启动xpath,一般都是谷歌的f12 开发者工具配合使用。

开始python 爬取抖音分享的网站数据

分析分享页面https://www.douyin.com/share/user/76055758243

1.抖音做了反派机制,抖音ID中的数字变成了字符串,进行替换。

{'name':['  ','  ','  '],'value':0},
        {'name':['  ','  ','  '],'value':1},
        {'name':['  ','  ','  '],'value':2},
        {'name':['  ','  ','  '],'value':3},
        {'name':['  ','  ','  '],'value':4},
        {'name':['  ','  ','  '],'value':5},
        {'name':['  ','  ','  '],'value':6},
        {'name':['  ','  ','  '],'value':7},
        {'name':['  ','  ','  '],'value':8},
        {'name':['  ','  ','  '],'value':9},

2.获取需要的节点的的xpath

# 昵称
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()

#抖音ID
//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()

#工作
//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()

#描述
//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()

#地址
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()

#星座
//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()

#关注数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()

#粉丝数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()

#赞数
//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()

  • 完整代码
import re
import requests
import time
from lxml import etree



def handle_decode(input_data,share_web_url,task):
    search_douyin_str = re.compile(r'抖音ID:')
    regex_list = [
        {'name':['  ','  ','  '],'value':0},
        {'name':['  ','  ','  '],'value':1},
        {'name':['  ','  ','  '],'value':2},
        {'name':['  ','  ','  '],'value':3},
        {'name':['  ','  ','  '],'value':4},
        {'name':['  ','  ','  '],'value':5},
        {'name':['  ','  ','  '],'value':6},
        {'name':['  ','  ','  '],'value':7},
        {'name':['  ','  ','  '],'value':8},
        {'name':['  ','  ','  '],'value':9},
    ]

    for i1 in regex_list:
        for i2 in i1['name']:
            input_data = re.sub(i2,str(i1['value']),input_data)
    share_web_html = etree.HTML(input_data)
    douyin_info = {}
    douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
    douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
    douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id


    try:
        douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
    except:
        pass
    douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
    douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
    douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
    douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
    fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
    if unit[-1].strip() == 'w':
        douyin_info['fans'] = str((int(fans_value)/10))+'w'
    like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
    if unit[-1].strip() == 'w':
        douyin_info['like'] = str(int(like)/10)+'w'
    douyin_info['from_url'] = share_web_url


    print(douyin_info)

def handle_douyin_web_share(share_id):
    share_web_url = 'https://www.douyin.com/share/user/'+share_id
    print(share_web_url)
    share_web_header = {
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
    }
    share_web_response = requests.get(url=share_web_url,headers=share_web_header)
    handle_decode(share_web_response.text,share_web_url,share_id)

if __name__ == '__main__':
    while True:
        share_id = "76055758243"
        if share_id == None:
            print('当前处理task为:%s'%share_id)
            break
        else:
            print('当前处理task为:%s'%share_id)
            handle_douyin_web_share(share_id)
        time.sleep(2)

mongodb

通过vagrant 生成虚拟机创建mongodb,具体查看
「docker实战篇」python的docker爬虫技术-python脚本app抓取(13)

su -
#密码:vagrant
docker

>https://hub.docker.com/r/bitnami/mongodb
>默认端口:27017
``` bash
docker pull bitnami/mongodb:latest
mkdir bitnami
cd bitnami
mkdir mongodb
docker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest

#关闭防火墙
systemctl stop firewalld

  • 操作mongodb

读txt文件获取userId的编号。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/1/30 19:35
# @Author  : Aries
# @Site    : 
# @File    : handle_mongo.py.py
# @Software: PyCharm

import pymongo
from pymongo.collection import Collection


client = pymongo.MongoClient(host='192.168.66.100',port=27017)
db = client['douyin']

def handle_init_task():
    task_id_collections = Collection(db, 'task_id')
    with open('douyin_hot_id.txt','r') as f:
        f_read = f.readlines()
        for i in f_read:
            task_info = {}
            task_info['share_id'] = i.replace('\n','')
            task_id_collections.insert(task_info)



def handle_get_task():
    task_id_collections = Collection(db, 'task_id')
    # return task_id_collections.find_one({})
    return task_id_collections.find_one_and_delete({})

#handle_init_task()

  • 修改python程序调用
import re
import requests
import time
from lxml import etree
from handle_mongo import handle_get_task
from handle_mongo import handle_insert_douyin


def handle_decode(input_data,share_web_url,task):
    search_douyin_str = re.compile(r'抖音ID:')
    regex_list = [
        {'name':['  ','  ','  '],'value':0},
        {'name':['  ','  ','  '],'value':1},
        {'name':['  ','  ','  '],'value':2},
        {'name':['  ','  ','  '],'value':3},
        {'name':['  ','  ','  '],'value':4},
        {'name':['  ','  ','  '],'value':5},
        {'name':['  ','  ','  '],'value':6},
        {'name':['  ','  ','  '],'value':7},
        {'name':['  ','  ','  '],'value':8},
        {'name':['  ','  ','  '],'value':9},
    ]

    for i1 in regex_list:
        for i2 in i1['name']:
            input_data = re.sub(i2,str(i1['value']),input_data)
    share_web_html = etree.HTML(input_data)
    douyin_info = {}
    douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
    douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
    douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id


    try:
        douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()
    except:
        pass
    douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')
    douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]
    douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]
    douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()
    fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
    unit = share_web_html.xpath("//div[    

相关文章


Google推出的爬虫新神器:Pyppeteer,神挡杀神,佛挡杀佛! 学习python12小时后,告诉你,学python真没你想的那么难! SpringBoot高级篇搜索Solr之文档新增与修改使用姿势 Note《Thinking in Java》第7章 复用类 python 单双分支 sqlserver2012——SqlCommand创建对象的三种方法 python 多分支 这是一篇测试博文

上一篇: web端ES绘图之image2D
下一篇:IDEA 常用设置积累

咨询·合作