Simple Bash HTTP Spider

I was tasked with finding broken links and links that did not link directly with a parent site.  Here is the simple Bash HTTP Spider that I wrote.  The followMaster.txt has the list of URLs referring to the parent site.  The links_outside.txt and links_outsideNew.txt have the external links or ones without the URL as the argument in them.

Usage: ./spider.sh http://url.url

#!/bin/bash
# This script is design to spider a web site for URLs
# The first argument is the URL that will be spidered...
# This core URL will remain as the spider goes through the site
# Will spider 5 rounds through the URLs found

if [ $# -eq 0 ]; then
    echo "Example: ./spider.sh url"
    echo "URL - URL to spider"
    echo ""
    exit
fi

wget $1 -O main.txt

cat main.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep "$1" | sort | uniq
> follow.txt
cat main.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep -v "$1" | sort | un
iq > links_outside.txt

cp follow.txt followMaster.txt

rm -f followNew.txt
rm -f links_outsideNew.txt
touch followNew.txt
touch links_outsideNew.txt

for i in {1..5}
do

    while read line
    do
        wget $line -O child.txt
   
        cat child.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep "$
1" | sort | uniq >> followNew.txt
        cat child.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep -v
 "$1" | sort | uniq >> links_outsideNew.txt

    done < "follow.txt"

    # Sort and find the uniq links from the loops above for links related to the the company
    cat followNew.txt | sort | uniq > followNew.temp
    cat followNew.temp > followNew.txt

    # Sort and find the uniq links from the loops above for the links not related to the company
    cat links_outsideNew.txt | sort | uniq > links_outsideNew.temp
    cat links_outsideNew.temp > links_outsideNew.txt

    # Compare the links in follow and followNew and add to the followMaster file
    comm follow.txt followNew.txt -1 -3 > followMaster.temp

    # Append to the followMaster main file
    cat followMaster.temp >> followMaster.txt

    # Recreate a followMaster file of the URLs found and scanned
    cat followMaster.txt | sort | uniq > followMaster.temp
    cat followMaster.temp > followMaster.txt

    # Recreate the follow.txt file for another round if specified in the for loop
    comm followMaster.txt followNew.txt -1 -3 > follow.txt

done

Comments

Popular posts from this blog

Netflix Streaming Blocked by Sophos UTM

BSides 2016 Hackers Challenge

Python - Vega Conflict Script to Maximize Fleet Sizes based on Fleet Mass

VBA - Script to Download a file from a URL

IoT Malware Analysis - CnC Server - Part 3