Skip to content

PAST2212/domainthreat

Repository files navigation

domainthreat

Daily Domain Monitoring for Brands and Mailing Domain Names

Current Version 3.13

Here you can find a Domain Monitoring tool. You can monitor your company brands (e.g. "amazon"), your mailing domains (e.g. "companygroup) or other words.

Motivation

Typical Domain Monitoring relies on brand names as input. Sometimes this is not sufficient enough to detect phishing attacks in cases where the brand names and mailing domain names are not equal.

Thought experiment: If example company "IBM" monitors their brand "IBM", send mails via @ibmgroup.com and attacker registers the domain ibrngroup.com (m = rn) for spear phishing purposes (e.g. CEO Fraud). Typical Brand (Protection) Domain Monitoring Solutions may experience difficulties because the distance between monitored brand name "IBM" and registered domain name "ibrngroup.com" is too big to classify it as a true positive and therefore makes it harder for the targeted company to take appropriate measures more proactively. This scenario is avoidable by also monitoring your mailing domain names and thus focussing more on text strings rather than brands.

This was the motivation for this project.

Detection Scope

  • full-word matching (e.g. amazon-shop.com),
  • regular typo squatting cases (e.g. ammazon.com),
  • typical look-alikes / phishing / so called CEO-Fraud domains (e.g. arnazon.com (rn = m),
  • IDN Detection / look-alike Domains based on full word matching (e.g. 𝗉ay𝞀al.com - greek letter RHO '𝞀' instead of latin letter 'p'),
  • IDN Detection / look-alike Domains based on partial word matching (e.g. 𝗉ya𝞀a1.com - greek letter RHO '𝞀' instead of latin letter 'p' AND "ya" instead of "ay" AND Number "1" instead of Letter "l")

Example Screenshot: Illustration of detected topic keyword 'tech' in source code of newly registered domain 'microsoftintegration[.]com' and detected subdomains image

Features

Key Features & CSV Output Columns

  • Unicode domain names (IDN) / Homoglyph / Homograph Detection

  • Variety of domain fuzzing / similarity algorithms

  • Automated Website Translations

  • Support of a variety of different languages

  • Detected By: Full Keyword Match or Similar/Fuzzy Keyword Match

  • Source Code Match: Keyword detection in websites - even if they are in other languages (e.g. chinese) by using different translators (normalized to english per default)
    ==> This is to cover needs of international companies and foreign-speaking markets

  • Website Status: Check website status by http status codes: HTTPError for a 4XX client error or 5XX server error response code

  • Parked: Check if domain is parked for 2XX or 3XX Status Code domains (experimental state)

  • Subdomains: Subdomain Scan

  • E-Mail Availability: Check if domain is ready for receiving mails and/or ready for sending mails

  • Daily CSV export into a calender week based CSV file (can be filtered by dates)

Other Features

  • Multithreading (CPU core based) & Multiprocessing & Async Requests
  • False Positive Reduction Instruments (e.g. self defined Blacklists, Thresholds depending on string lenght)
  • Keyword detection in websites which neither contain brands in domain names nor are similar registered

Principles

1. Basic Domainmonitoring

1.1. Keywords from file "domainthreat/data/userdata/keywords.txt" (e.g. tuigroup) are used to make full-word detection (e.g. newtuigroup.shop) and similar-word detection (e.g. tuiqroup.com (g=q)) on newly registered domain names.

1.2. Keywords from file "domainthreat/data/userdata/topic_keywords.txt" are used to find these keywords (e.g. travel) in source code of (translated) webpages (e.g. dulichtui.com) of domain monitoring results from point 1.1.

==> Results are exported to Newly_Registered_Domains_Calender_Week_ .csv File into Project Root Directory

2. Advanced Domainmonitoring

2.1. Keywords from file "domainthreat/data/userdata/topic_keywords.txt" (e.g. holiday) are used to make full-word detection (e.g. usa-holiday.net) on newly registered domain names.

2.2. Keywords from file "domainthreat/data/userdata/topic_keywords.txt" (e.g. holiday) are automatically translated into the languages which are provided by the User in the "domainthreat/data/userdata/languages_advanced_monitoring.txt" file. Please see "supported_languages.txt" for supported languages at this moment. Copy / Paste the demanded languages from "supported_languages.txt" to "domainthreat/data/userdata/languages_advanced_monitoring.txt" file if you want to (empty per default). Punycode domains are not supported by these translations at the moment.

==> Results from 2.1. will be enhanced by translated keywords from "domainthreat/data/userdata/topic-keywords.txt" file. For example "urlaub" is the german word for "holiday". The program will now find in addition german registerd domains like "sea-urlaub.com"

2.3. Keywords from file "domainthreat/data/userdata/unique_brand_names.txt" are used to find these keywords (e.g. tui) in webpages of monitoring results from point 2.1. (e.g. usa-holiday.net) and from 2.2. (e.g. sea-urlaub.com) (if any supported languages are provided)

==> Results are exported to Advanced_Monitoring_Results_Calender_Week_ .csv File into Project Root Directory

Instructions

How to install:

How to run:

--similarity : Selection of similarity mode of homograph, typosquatting detection algorithms with options "close" OR "wide" OR "medium".

  • close: Less false positives and (potentially) more false negatives (per default)
  • wide: More false positives and (potentially) less false negatives
  • medium: Tradeoff between both mode options close and wide.

--threads : Number of Threads

  • Default: Number of Threads is based on CPU cores

Running program per default (CPU core based + close similarity mode as default mode):

  • "python3 domainthreat.py"

Running program in wide similarity mode with 50 threads:

  • "python3 domainthreat.py --similarity wide --threads 50" image

How to update:

  • cd domainthreat

  • git pull

  • In case of a Merge Error: Try "git reset --hard" before "git pull"

    ==> Make sure to make a backup of your userdata folder before update

Before the first run - How it Works:

  1. Put your brand names or mailing domain names into this TXT file "domainthreat/data/userdata/keywords.txt" line per line for monitoring operations (without the TLD). Some "TUI" Names are listed per default.

  2. Put common word collisions into this TXT file "domainthreat/data/userdata/blacklist_keywords.txt" line per line you want to exclude from the results to reduce false positives.

  • e.g. blacklist "lotto" if you monitor keyword "otto", e.g. blacklist "amazonas" if you want to monitor "amazon", e.g. blacklist "intuitive" if you want to monitor "tui" ...
  1. Put commonly used words into this TXT file "domainthreat/data/userdata/topic_keywords.txt" line per line that are describing your brands, industry, brand names, products on websites. These keywords will be used for searching / matching in source codes of webistes. Default and normalized language is english for performing automated translation operations from HTML Title, Description and Keywords Tag via different translators.
  • e.g. Keyword "fashion" for a fashion company, e.g. "sneaker" for shoe company, e.g. "Zero Sugar" for Coca Cola Inc., e.g. "travel" for travel company...
  1. Put your brand names into this TXT file "domainthreat/data/userdata/unique_brand_names.txt" line per line for monitoring operations (e.g. "tui"). These keywords will be used for searching / matching in sources codes of websites which neither contain your brand names in domain name nor are similar registered to them (e.g. usa-holiday.net). Some "TUI" Names are listed per default.

Troubleshooting

  • In case of errors with modules "httpcore" or "httpx" - possible fixes:
    • pip uninstall googletrans (in case you have installed older version of domainthreat as of version <= 2.11)
    • pip install --upgrade pip
    • pip install --upgrade httpx
    • pip install --upgrade httpcore

Changelog

Notes

Author

TO DO

  • Add additional fuzzy matching algorithms to increase true positive rate / accurancy (Sequence-based algorithm "Longest Common Substring" is already included but not activated by default)
  • Enhance source code keyword detection on subdomain level
  • AI based Logo Detection by Object Detection
  • PEP8 compliance

Additional

  • Public source for newly registered domains whoisds (https://www.whoisds.com/newly-registered-domains) has capped quantity of daily registrations to 100.000. There are other sources out there. Use them instead if you feel to it.
  • Thresholds for similarity modes (wide, medium, close) have been selected carefully. The "wide" range has a possible high false positive rate (and therefore lower precision rate) in order to consider degree of freedom in registering different variations of domain names more accurately (reduce occurrence of false negatives and therefore better recall rate). Change the thresholds over the different modes if you want to match your needs better. I can strongly recommend this article go get a better understanding of recall-precision tradeoff: https://towardsdatascience.com/precision-vs-recall-evaluating-model-performance-in-credit-card-fraud-detection-bb24958b2723
  • A perfect supplement to this wonderful project: https://github.com/elceef/dnstwist
  • Written in Python 3.10
  • Recommended Python Version >= 3.8
  • Some TLDs are not included in this public source (e.g. ".de" domains). You can bypass it by using my other project https://github.com/PAST2212/certthreat that uses CERT Transparency Logs as Input instead.