I was tasked with finding broken links and links that did not link directly with a parent site. Here is the simple Bash HTTP Spider that I wrote. The followMaster.txt has the list of URLs referring to the parent site. The links_outside.txt and links_outsideNew.txt have the external links or ones without the URL as the argument in them.
Usage: ./spider.sh http://url.url
#!/bin/bash
# This script is design to spider a web site for URLs
# The first argument is the URL that will be spidered...
# This core URL will remain as the spider goes through the site
# Will spider 5 rounds through the URLs found
if [ $# -eq 0 ]; then
echo "Example: ./spider.sh url"
echo "URL - URL to spider"
echo ""
exit
fi
wget $1 -O main.txt
cat main.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep "$1" | sort | uniq
> follow.txt
cat main.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep -v "$1" | sort | un
iq > links_outside.txt
cp follow.txt followMaster.txt
rm -f followNew.txt
rm -f links_outsideNew.txt
touch followNew.txt
touch links_outsideNew.txt
for i in {1..5}
do
while read line
do
wget $line -O child.txt
cat child.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep "$
1" | sort | uniq >> followNew.txt
cat child.txt | grep "a href" | sed 's/.*<a href="//' | sed 's/">.*//' | awk '{print $1}' | grep -v -e "javascript:" | sed 's/"//' | grep -v
"$1" | sort | uniq >> links_outsideNew.txt
done < "follow.txt"
# Sort and find the uniq links from the loops above for links related to the the company
cat followNew.txt | sort | uniq > followNew.temp
cat followNew.temp > followNew.txt
# Sort and find the uniq links from the loops above for the links not related to the company
cat links_outsideNew.txt | sort | uniq > links_outsideNew.temp
cat links_outsideNew.temp > links_outsideNew.txt
# Compare the links in follow and followNew and add to the followMaster file
comm follow.txt followNew.txt -1 -3 > followMaster.temp
# Append to the followMaster main file
cat followMaster.temp >> followMaster.txt
# Recreate a followMaster file of the URLs found and scanned
cat followMaster.txt | sort | uniq > followMaster.temp
cat followMaster.temp > followMaster.txt
# Recreate the follow.txt file for another round if specified in the for loop
comm followMaster.txt followNew.txt -1 -3 > follow.txt
done
Twitter: @lokut
This blog is for educational purposes only. The opinions expressed in this blog are my own and do not reflect the views of my employers.
Subscribe to:
Post Comments (Atom)
Test Authentication from Linux Console using python3 pexpect
Working with the IT420 lab, you will discover that we need to discover a vulnerable user account. The following python3 script uses the pex...
-
Here is a quick walk through of GetBoo. The first item that I found was you can harvest the usernames of the existing users that are regist...
-
As I was glancing through the logs of my honeypots I spent some time to look at the following logs. In the past I have just overlooked them...
-
I thought I would work through a few of these web applications provided by OWASP on their broken web applications VM. The first one I th...
-
Today looking at the logs of the honeypots, I became curious based on the whois of the IP Addresses attempting to login to SSH which country...
-
Recently I was doing some scanning with a tool that is available on github called masscan. The tool allows you to configure a configuration...
No comments:
Post a Comment