Skip to content

github-scraper used to scan repos owned by an org, clone them locally, look for a Dockerfile, and extract the FROM into a nice CSV for management

License

Notifications You must be signed in to change notification settings

psenger/github-scraper

Repository files navigation

github-scraper

Purpose

github-scraper is used to scan repos owned by an org, clone them locally, look for a Dockerfile, extract the FROM (build) value into a nice CSV for management to use in its reports, or to find a container that is running at the wrong version without asking the Dev Ops guys to do it.

Script Purpose
scraper.js Pulls all the repo data belonging to the org ( as defined by type ) and stores the data in a file ./data/<GITHUB-OUTFILE>. This file drives everything else.
build-masterlist.js This just reads ./data/<GITHUB-OUTFILE> and builds a CSV file ./data/<GITHUB-CSVFILE>
build-inventory.js Removes the directory ./out/ which will be the clone directory, once cloned, scans all files for a Dockerfile, reads them, and extracts ^FROM\s+(.*)\s*$ to a report called ./data/<GITHUB-INVENTORY>

Running

Required

  • A good internetnet connection
  • Node 15

Steps

  1. from the command prompt run npm install
  2. create a .env file with the environment variables listed in Variables
  3. from the command prompt run npm run build-masterlist
  4. from the command prompt run npm run scraper
  5. from the command prompt run npm run build-inventory
  6. send your report to your boss, and then drink some coffee or reach out to me Philip A Senger philip.a.senger@cngrgroup.com for a job.

Additional Docs

Refer to OctoKit for the Git hub api.

Refer to dotenv for a better understanding of .env files

Refer to Github Guides for Github

Refer to Docker Docs for Docker

Variables

This project uses .env

Variable Required Default Purpose
GITHUB-PAL-TOKEN true Personal access token (create)
GITHUB-TIMEZONE true The time zone (list)
GITHUB-ORG true The org to scan in the repos
GITHUB-TYPE true Specifies the types of repositories you want returned. Can be one of all, public, private, forks, sources, member, internal. Default: all. If your organization is associated with an enterprise account using GitHub Enterprise Cloud or GitHub Enterprise Server 2.20+, type can also be internal.
GITHUB-CSVFILE false ./data/data.csv Builds a CSV master list file ( when build-masterlist is executed )
GITHUB-OUTFILE false ./data/data.json Output from the scraper command, a full listing from github.
GITHUB-INVENTORY false ./data/inventory.csv the results of scanning files in github ( in this repo it is the Dockerfile FROM command )
GITHUB-SKIP-NAMES false '' any repos you want to skip while building the inventory.

Todo

  • The environment variables and expected chaining of data files is problematic.
  • Might be nice to scan for repos owned by owners and or orgs.
  • I think extracting the shell commands would be good, so you can make the code more reusable
  • Naming convention is not so good.
  • linting and tests would be good.
  • update build-masterlist to use the csv module and extract fields to environment variables.
  • change GITHUB-ORG so it is defaulted to all

About

github-scraper used to scan repos owned by an org, clone them locally, look for a Dockerfile, and extract the FROM into a nice CSV for management

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published