Skip to content

lizhuoqi/crawling

Repository files navigation

说明-中文版本

Introduction

This is a web crawling program is implemented using go. This program uses ferret to define and execute fql, go-co-op/gocron to schedule tasks, and viper to read configuration files, including program configuration and task configuration.

After comparing several scrap tools, I chose ferret to support the web data crawling , because it uses fql dsl to define, it can be easily modified, and no need to recompile the program after modification.

In order to use ferret in the program through library, I studied the example of the ferret open source code, and imitated the example to embed ferret as a library in the program, so that there is no need to use ferret cli.

In order to support multiple fql scripts, and each script can have different schedule strategies, go-co-op/gocron is introduced as scheduler in the program.

Configuration instructions

Divided into program configuration, job configuration, fql script.

Program configuration

Program configuration file, configs/crawl.[yaml|json|toml], the current version of the program cannot modify this path. Because this program uses spf13/viper as the configuration tool library, it can support the format supported by viper.

The currently valid configuration items are:

jobsdir: configs/fql # default configDir/jobs
fqldir: configs/fql  # default configDir/fql
outdir: output       # default current working directory
scheduler:
  # maximum fql crawling jobs running cocurrenly running
  max: 4

jobsdir : The program will scan the crawling job configuration files in this directory

fqldir : Root directory of ferret scripts

outdir : The output location of the ferret script execution

scheduler.max : Maximum concurrency of gocron scheduling

Fetch job configuration

The jobsdir in configs/crawl.yaml points to the directory where the job configuration file is located. The program will scan all yaml job configuration files in this directory and its subdirectories.

Spec of the job configuration file is as follows:

enable: true
fqljobs:
  -name: zhihu/hotlist
    desc: Know the hot list
    script: zhihu/hotlist.fql
    output: zhihu_hotlist.json
    enable: true
    schedule:
      every: 3m
  -name: bilibili/weekly
    script: bilibili/weekly.fql
    desc: stop b must see every week
    schedule:
      cron: "* 3 18/1 * * 5 *" # Every Friday 18:00-23:00

enable: : Required, default false. So, you can disable/enable all the jobs in one yaml.

job.name : Required, job name, it needs to be unique in a single yaml configuration file

job.desc : Optional, description information, help memorize and understand

job.script : Required, the relative path of the ferret query script. The program will search from the directory specified by configs/crawl.yaml#fqldir

job.output : Optional, the relative path of the files saved in the fql result. These files will be saved in the directory specified by config/crawl.yaml#output. If it is missing, the program will task default value to the format of config/crawl.yaml#fqldir + '_' + script + .json, such as bilibili/weekly above does not specify output, then the default name of the output file automatically generated by the program is configs_fql_bilibili_weekly.json

job.enable : Optional, default false. So, only jobs with enable=true would be loaded.

job.schedule : Read cron first, if missing, try to read every. If boths none exists, the default is 7m. Because this program uses the go-co-op/gocron scheduling library, you can use the cron expression when filling in cron ; When filling in every, you can choose to the following units to define the time frequency, s->seconds, m->minutes, h->hours. If the time interval is very long, such as many days or one month, it is recommended to use cron expression to control the time more accurately

The job configuration file can configure multiple jobs at the same time, and the number of job configuration files is not limited, as long as it is in the directory or subdirectory specified by configs/crawl.yaml#fqldir. Plan according to your needs.

fql

ferret query script, this program will search in the config/crawl.yaml#fqldir directory or subdirectories. These fql scripts are written in ferret query language, and can be developed and tested through montferret.dev/try/ in advance.

Program running

make sure that the program has been compiled, or download the executable.

How to install or compile

  • Download

github.com/lizhuoqi/crawling/releases gitee.com/lizhuoqi/crawling/releases

  • Compile by yourself
> git clone the open source url crawling of this program
> go mod tidy -v
> go build -v
> # Or use make
> make build

Run

> ./crawl

When the program is running, the generated json file will overwrite the existing file, and the history will not be saved. If you need to get updates in time, you need to install a file watcher to trigger subsequent actions.