Idle chatter
Recently, the company has required checking whether the company's homepage has been hung with hidden links. After searching online, there are few corresponding scripts, so I wrote a script about obtaining website links. With the increasing requirements and some imaginative thoughts, I finally developed an URL collector
Preface
URL collection is an important task that can help us quickly collect relevant URLs that meet the requirements. However, the principle of most URL collection software on the market is to use multiple search engine interfaces, input keywords, such as: collecting recruitment website URLs, it is generally input keywords such as job hunting/recruitment, and then collect URLs from each interface to the maximum extent, customize blacklist URLs, and finally deduplicate.
This means that you need to have as many interfaces as possible, including but not limited to Google, Baidu, etc., then pass parameters to extract URLs from the returned pages based on blacklist filtering, and finally iterate through the page numbers.
It seems correct, input keywords to get relevant URLs. However, it hides several shortcomings:
1. The collected URLs are all indexed by search engines, leading to many URLs that meet the requirements being unable to be collected
2. The filtering is not detailed, and the collected sites cannot be guaranteed to be needed only by relying on deduplication and blacklist filtering
3. The URL collection can be used by everyone, and the keywords are more or less the same, which leads to the fact that the final collection results are also more or less the same. This is not friendly to cybersecurity personnel because it means that it is possible that a vulnerability site you have found has already been exploited by many people
Function

To address the above shortcomings, I plan to write a URL deep collection script, the preliminary functional points are as follows:
1. Provide two entry points, one is a search engine interface or import the collected website URLs
2. Crawl to the website URLs that meet the requirements with keywords and then automatically perform friendly link crawling again
3. The imported text can first filter out sites that do not meet the requirements, and then customize whether to perform friendly link crawling
4. Users can customize URL whitelist and blacklist, URL website title whitelist and blacklist, and URL webpage content whitelist and blacklist
The brief process diagram is as follows:
banner
title = '''
__ _______ __ __ .______ __
| | | ____| | | | | | _ \ | |
| | | |__ | | | | | |_) | | |
.--. | | | __| | | | | | / | |
| `--' | | | | `--' | | |\ \----.| `----.
\______/ |__| _____\______/ | _| `._____||_______|
|______|
Author: JF
Version: V1.0
'''
URL collection source code
Friend link collection
Method one: Regular expression filtering
def GetLink(url):
UA = random.choice(headerss)
headers = {'User-Agent': UA, 'Connection': 'close'}
link_urls = []
try:
r = requests.get(url, headers=headers, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
urls = [f"{urlparse(url).scheme}://{urlparse(url).netloc}" for url in re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', content, re.I)]
for url in list(set(urls)):
url = url.replace('\','')
link_urls.append(url)
#�断存活
# try:
# r = requests.get(url, timeout=5, verify=False)
# if b'Service Unavailable' not in r.content and b'The requested URL was not found on' not in r.content and b'The server encountered an internal error or miscon' not in r.content:
# if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
# link_urls.append(url)
# except Exception as error:
# pass
except:
pass
return list(set(link_urls))
Method two: bs4 filtering
def GetLink(url):
UA = random.choice(headerss)
headers = {'User-Agent': UA, 'Connection': 'close'}
try:
r = requests.get(url, headers=headers, verify=False)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
# Use BeautifulSoup to parse HTML content
soup = BeautifulSoup(content, 'html.parser')
# Extract common tag attributes that may contain URLs on the page
bs4_urls = set()
for tag in ['a', 'img', 'script', 'link']:
for attr in ['href', 'src']:
for element in soup.find_all(tag):
if attr in element.attrs:
href = element.get(attr)
if href and (href.startswith('http://') or href.startswith('https://')):
parsed = urlparse(href)
url = f"{parsed.scheme}://{parsed.netloc}"
bs4_urls.add(url)
except Exception as e:
pass
# Add live URL
link_urls = []
for bs4_url in bs4_urls:
try:
r = requests.head(bs4_url, timeout=5, headers=headers, verify=False)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 301:
link_urls.append(bs4_url)
except Exception as error:
pass
return link_urls
Keyword collection
Search for the user's input keywords using Baidu search interface and extract the URLs of the first 7 pages
def BDUrl(key):
cookie = input('Please enter the cookie:')
bd_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Referer": "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&ch=&tn=baiduhome_pg&bar=&wd=123&oq=123&rsv_pq=896f886f000184f4&rsv_t=fdd2CqgBgjaepxfhicpCfrqeWVSXu9DOQY5WyyWqQYmsKOC%2Fl286S248elzxl%2BJhOKe2&rqlang=cn",
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
"Sec-Fetch-Mode": "navigate",
"Cookie": cookie,
"Connection": "Keep-Alive",
}
bd_url = []
for page in range(0, 8):
url = 'http://www.baidu.com/s?wd={}&pn={}0'
try:
r = requests.get(url.format(key, page), headers=bd_headers, verify=False)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
result = [f"{urlparse(url).scheme}://{urlparse(url).netloc}" for url in re.findall('mu="(.*?)"', content)[1:]]
#result = [url.split('//')[1].split('/')[0] for url in re.findall('mu="(.*?)"', content)[1:]]
for res_url in list(set(result)):
bd_url.append(res_url)
#�断存活
# try:
# r = requests.get(res_url, timeout=5, verify=False)
# if b'Service Unavailable' not in r.content and b'The requested URL was not found on' not in r.content and b'The server encountered an internal error or miscon' not in r.content:
# if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
# bd_url.append(res_url)
# except Exception as error:
# pass
except:
pass
return list(set(bd_url))
ini configuration file
The core of the script: Users filter out the desired URLs through the content of the custom configuration file
[User]
# Username of the program
whoami = JF
# state friend link crawling, 0 closed, 1 enabled
# Other:
# None does not check this keyword
# Support or (or) logic, that is, |
# Priority: URL black list > URL white list > Title black list > Title white list > Web content black list > Web content white list
[Config]
# Friend link crawling
state = 0
# Black list of URLs
black_url = None
# White list of URLs
white_url = None
# Black list of titles
black_title = None
# White list of titles
white_title = 安全狗
# Black list of web page content
black_content = None
# White list of web page content
white_content = None
# Connection timeout 5 seconds
timeout = 5
User rule tree
Function written in the configuration file
Method one: parameter is a list of urls
def RuleUrl(urls):
ruleurls = []
UA = random.choice(headerss)
header = {'User-Agent': UA, 'Connection': 'close'}
# The first step, limit URL by blacklist, exclude all that are in the blacklist
black_url_or = []
for url in urls:
if black_url == 'None':
black_url_or.append(url)
elif '|' in black_url:
black_url_key = black_url.split('|')
if all(key not in url for key in black_url_key):
black_url_or.append(url)
else:
black_url_key = black_url
if any(key not in url for key in black_url_key):
black_url_or.append(url)
# The second step, limit URL by whitelist, only save those that appear in the whitelist
white_url_or = []
for url in black_url_or:
if white_url == 'None':
white_url_or.append(url)
elif '|' in white_url:
white_url_key = white_url.split('|')
if any(key in url for key in white_url_key):
white_url_or.append(url)
else:
white_url_key = white_url
if all(key in url for key in white_url_key):
white_url_or.append(url)
# The third step, filter website title by blacklist, exclude all that are in the blacklist
black_title_or = []
for url in white_url_or:
if black_title == 'None':
black_title_or.append(url)
elif '|' in black_title:
black_title_key = black_title.split('|')
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
title = re.findall('<title>(.*?)</title>', content, re.S)
if all(key not in title[0] for key in black_title_key):
black_title_or.append(url)
except:
pass
else:
black_title_key = black_title
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
title = re.findall('<title>(.*?)</title>', content, re.S)
if all(key not in title[0] for key in black_title_key):
black_title_or.append(url)
except:
pass
# The fourth step, filter website title by whitelist, only save those that appear in the whitelist
white_title_or = []
for url in black_title_or:
if white_title == 'None':
white_title_or.append(url)
elif '|' in white_title:
white_title_key = white_title.split('|')
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
title = re.findall('<title>(.*?)</title>', content, re.S)
if any(key in title[0] for key in white_title_key):
white_title_or.append(url)
except:
pass
else:
white_title_key = white_title
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
title = re.findall('<title>(.*?)</title>', content, re.S)
if all(key in title[0] for key in white_title_key):
white_title_or.append(url)
except:
pass
# The fifth step, filter web content by blacklist, exclude all that appear in the blacklist
black_content_or = []
for url in white_title_or:
if black_content == 'None':
black_content_or.append(url)
elif '|' in black_content:
black_content_key = black_content.split('|')
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
if all(key not in content for key in black_content_key):
black_content_or.append(url)
except:
pass
else:
black_content_key = black_content
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
if any(key not in content for key in black_content_key):
black_content_or.append(url)
except:
pass
# Step six, filter website content whitelist, only save the urls that appear in the whitelist
white_content_or = []
for url in black_content_or:
if white_content == 'None':
white_content_or.append(url)
elif '|' in white_content:
white_content_key = white_content.split('|')
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
if any(key in content for key in white_content_key):
white_content_or.append(url)
except:
pass
else:
white_content_key = white_content
try:
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
if all(key in content for key in white_content_key):
white_content_or.append(url)
except:
pass
return white_content_or
Method two: Modify the received parameter to a single URL
Although the overall logic of the first version has been implemented, the efficiency is too slow, and this version has changed the overall logic
Modify the received parameter to a single URL, use true/false to judge whether the input URL meets the conditions, and then implement concurrency
def rule_url(url):
# The first step, limit URL by blacklist, exclude all that are in the blacklist
if black_url != 'None' and (any(key in url for key in black_url.split('|'))):
return False
# The second step, limit URL by whitelist, only save those that appear in the whitelist
if white_url != 'None' and (all(key not in url for key in white_url.split('|'))):
return False
try:
UA = random.choice(headerss)
header = {'User-Agent': UA, 'Connection': 'close'}
r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
encoding = requests.utils.get_encodings_from_content(r.text)[0]
content = r.content.decode(encoding)
title = re.findall('<title>(.*?)</title>', content, re.S)[0]
# The third step, filter website title by blacklist, exclude all that are in the blacklist
if black_title != 'None' and (any(key in title for key in black_title.split('|'))):
return False
# The fourth step, filter website title by whitelist, only save those that appear in the whitelist
if white_title != 'None' and (all(key not in title for key in white_title.split('|'))):
return False
# The fifth step, filter web content by blacklist, exclude all that appear in the blacklist
if black_content != 'None' and (any(key in content for key in black_content.split('|'))):
return False
# Step six, filter website content whitelist, only save the urls that appear in the whitelist
if white_content != 'None' and (all(key not in content for key in white_content.split('|'))):
return False
return url
else:
return False
except:
return False
return False
Entry function
# Program entry
def Result():
print(f'Current user: {whoami}')
if state == '0':
print(f'[-] Friend link crawling: disabled')
elif state == '1':
print(f'[+] Friend link crawling: enabled')
else:
print(f'[x] Friend link crawling: Enter 0/1!')
if black_url == 'None':
print(f'[-] URL blacklist: disabled', end='')
else:
print(f'[+] URL blacklist: enabled', end='')
if white_url == 'None':
print(f' [-] URL whitelist: disabled')
else:
print(f' [+] URL whitelist: enabled')
if black_title == 'None':
print(f'[-] Title blacklist: disabled', end='')
else:
print(f'[+] Title blacklist: enabled', end='')
if white_title == 'None':
print(f' [-] Title whitelist: disabled')
else:
print(f' [+] Title whitelist: enabled')
if black_content == 'None':
print(f'[-] Website blacklist: disabled', end='')
else:
print(f'[+] Website blacklist: enabled', end='')
if black_content == 'None':
print(f' [-] Website whitelist: disabled')
else:
print(f' [+] Website whitelist: enabled')
print('='*50)
print('0: Keyword scanning 1: Import text scanning')
try:
num = int(input('Please select the startup mode (0/1):'))
if num == 0:
rurls = set()
keywor = input('Please enter the keyword:')
t1 = time.time()
bd = bd_urls(keywor, num)
rule_bd_urls = set()
with ThreadPoolExecutor(max_workers=10) as executor:
results_bd = executor.map(rule_url, bd)
with lock:
for url in results_bd:
if url:
rule_bd_urls.add(url)
# Check each URL for friendship links
l_urls = set()
rule_link_urls = set()
for url in rule_bd_urls:
link_urls = get_links(url)
l_urls.update(link_urls)
with ThreadPoolExecutor(max_workers=10) as executor:
results_link = executor.map(rule_url, l_urls)
with lock:
for url in results_link:
if url:
rule_link_urls.add(url)
# #You can directly save a set or list to reduce IO
with open(f'{keywor}_url.txt', 'a+', encoding='utf-8') as a:
a.write('\n'.join(rule_link_urls))
t2 = time.time()
print(f'Scanning completed, time taken: {t2-t1}, results saved to [{keywor}_url.txt])
elif num == 1:
print('Hint: The URLs in the text must include the protocol type, such as: http/https')
urls = set([url.strip() for url in open(input('Drag the required url to this window:'),'r',encoding='utf-8')])
print(f'Total {len(urls)} sites found in text, scanning in progress...')
result = set()
if state == '0':
t1 = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
results_link = executor.map(rule_url, urls)
with lock:
for url in results_link:
if url:
result.add(url)
elif state == '1':
t1 = time.time()
l_url = set()
for url in urls:
l = get_links(url)
l_url.update(l)
with ThreadPoolExecutor(max_workers=10) as executor:
results_link = executor.map(rule_url, l_url)
with lock:
for url in results_link:
if url:
result.add(url)
with open(filename, 'a+', encoding='utf-8') as a:
a.write('\n'.join(result))
t2 = time.time()
t = str(t2-t1).split('.')[0]
print(f'Scanning completed, time taken: {t}s, results saved to [{filename}]')
else:
print('Input error, program terminated!')
except Exception as e:
print(f'Input error, program terminated, error type: {e}')
if __name__ == '__main__':
title = '''
_ ______ _ _ _____ _
| | ____| | | | | __ \| |
| | |__ | | | | |__) | |
_ | | __| | | | | _ /| |
| |__| | | | |__| | | \ \| |____
\____/|_| \____/|_| \_\______|
'''
#print(title)
Result()
Use for testing
The script has been tested on more than 10,000 sites, and it is currently running normally
Configuration file config.ini
None does not check the keyword, only supports or logic, that is, the symbol |
Not checking is represented by None, the field cannot be left blank, otherwise the script cannot run normally
state only supports 0/1, 0 closes the friend link crawling of imported text, 1 opens the friend link crawling of imported text
Keyword priority: Blacklist URL > Whitelist URL > Blacklist Title > Whitelist Title > Blacklist Web Content > Whitelist Web Content
Demonstration:
Note: After crawling, the results are saved in txt format in the current directory
1. Perform crawling on educational sites through search engines
2. First filter out educational sites from the imported text, and do not perform friend link crawling
3. First filter out educational sites from the imported text, and then perform friend link crawling
Conclusion
By the above code, you can complete the script. If you don't want to trouble, you can also download the one I have packaged. Currently, the script has been uploaded to github: https://github.com/JiangFengSec/JF_URL. Those who are interested can download and try it. If it helps you, please give it a stars. Thank you!

评论已关闭