2. First filter out educational sites from the imported text, and do not perform friend link crawling

0 27
Idle chatterRecently, the company has required checking whether the company's ho...

Idle chatter

Recently, the company has required checking whether the company's homepage has been hung with hidden links. After searching online, there are few corresponding scripts, so I wrote a script about obtaining website links. With the increasing requirements and some imaginative thoughts, I finally developed an URL collector

Preface

URL collection is an important task that can help us quickly collect relevant URLs that meet the requirements. However, the principle of most URL collection software on the market is to use multiple search engine interfaces, input keywords, such as: collecting recruitment website URLs, it is generally input keywords such as job hunting/recruitment, and then collect URLs from each interface to the maximum extent, customize blacklist URLs, and finally deduplicate.
This means that you need to have as many interfaces as possible, including but not limited to Google, Baidu, etc., then pass parameters to extract URLs from the returned pages based on blacklist filtering, and finally iterate through the page numbers.
It seems correct, input keywords to get relevant URLs. However, it hides several shortcomings:
1. The collected URLs are all indexed by search engines, leading to many URLs that meet the requirements being unable to be collected
2. The filtering is not detailed, and the collected sites cannot be guaranteed to be needed only by relying on deduplication and blacklist filtering
3. The URL collection can be used by everyone, and the keywords are more or less the same, which leads to the fact that the final collection results are also more or less the same. This is not friendly to cybersecurity personnel because it means that it is possible that a vulnerability site you have found has already been exploited by many people

Function

2. First filter out educational sites from the imported text, and do not perform friend link crawling

To address the above shortcomings, I plan to write a URL deep collection script, the preliminary functional points are as follows:
1. Provide two entry points, one is a search engine interface or import the collected website URLs
2. Crawl to the website URLs that meet the requirements with keywords and then automatically perform friendly link crawling again
3. The imported text can first filter out sites that do not meet the requirements, and then customize whether to perform friendly link crawling
4. Users can customize URL whitelist and blacklist, URL website title whitelist and blacklist, and URL webpage content whitelist and blacklist
The brief process diagram is as follows:
image.png
banner

title = '''
       __   _______   __    __  .______       __
      |  | |   ____| |  |  |  | |   _  \     |  |
      |  | |  |__    |  |  |  | |  |_)  |    |  |
.--.  |  | |   __|   |  |  |  | |      /     |  |
|  `--'  | |  |      |  `--'  | |  |\  \----.|  `----.
 \______/  |__|  _____\______/  | _| `._____||_______|
                |______|
                                        Author: JF
                                        Version: V1.0
        '''

URL collection source code

Friend link collection

Method one: Regular expression filtering

def GetLink(url):
    UA = random.choice(headerss)
    headers = {'User-Agent': UA, 'Connection': 'close'}
    link_urls = []
    try:
        r = requests.get(url, headers=headers, verify=False, timeout=timeout)
        encoding = requests.utils.get_encodings_from_content(r.text)[0]
        content = r.content.decode(encoding)
        urls = [f"{urlparse(url).scheme}://{urlparse(url).netloc}"  for url in re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', content, re.I)]
        for url in list(set(urls)):
            url = url.replace('\','')
            link_urls.append(url)
            #�断存活
            # try:
            #     r = requests.get(url, timeout=5, verify=False)
            #     if b'Service Unavailable' not in r.content and b'The requested URL was not found on' not in r.content and b'The server encountered an internal error or miscon' not in r.content:
            #     if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
            #             link_urls.append(url)
            # except Exception as error:
            #     pass
    except:
        pass
    return list(set(link_urls))

Method two: bs4 filtering

def GetLink(url):
    UA = random.choice(headerss)
    headers = {'User-Agent': UA, 'Connection': 'close'}
    try:
        r = requests.get(url, headers=headers, verify=False)
        encoding = requests.utils.get_encodings_from_content(r.text)[0]
        content = r.content.decode(encoding)

        # Use BeautifulSoup to parse HTML content
        soup = BeautifulSoup(content, 'html.parser')

        # Extract common tag attributes that may contain URLs on the page
        bs4_urls = set()
        for tag in ['a', 'img', 'script', 'link']:
            for attr in ['href', 'src']:
                for element in soup.find_all(tag):
                    if attr in element.attrs:
                        href = element.get(attr)
                        if href and (href.startswith('http://') or href.startswith('https://')):
                            parsed = urlparse(href)
                            url = f"{parsed.scheme}://{parsed.netloc}"
                            bs4_urls.add(url)
    except Exception as e:
        pass

    # Add live URL
    link_urls = []
    for bs4_url in bs4_urls:
        try:
            r = requests.head(bs4_url, timeout=5, headers=headers, verify=False)
            if r.status_code == 200 or r.status_code == 301 or r.status_code == 301:
                    link_urls.append(bs4_url)
        except Exception as error:
            pass
    return link_urls

Keyword collection

Search for the user's input keywords using Baidu search interface and extract the URLs of the first 7 pages

def BDUrl(key):
    cookie = input('Please enter the cookie:')
    bd_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Referer": "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&ch=&tn=baiduhome_pg&bar=&wd=123&oq=123&rsv_pq=896f886f000184f4&rsv_t=fdd2CqgBgjaepxfhicpCfrqeWVSXu9DOQY5WyyWqQYmsKOC%2Fl286S248elzxl%2BJhOKe2&rqlang=cn",
        # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
        "Sec-Fetch-Mode": "navigate",
        "Cookie": cookie,
        "Connection": "Keep-Alive",
    }
    bd_url = []
    for page in range(0, 8):
        url = 'http://www.baidu.com/s?wd={}&pn={}0'
        try:
            r = requests.get(url.format(key, page), headers=bd_headers, verify=False)
            encoding = requests.utils.get_encodings_from_content(r.text)[0]
            content = r.content.decode(encoding)
            result = [f"{urlparse(url).scheme}://{urlparse(url).netloc}" for url in re.findall('mu="(.*?)"', content)[1:]]
            #result = [url.split('//')[1].split('/')[0] for url in re.findall('mu="(.*?)"', content)[1:]]
            for res_url in list(set(result)):
                bd_url.append(res_url)
                #�断存活
                # try:
                #     r = requests.get(res_url, timeout=5, verify=False)
                #     if b'Service Unavailable' not in r.content and b'The requested URL was not found on' not in r.content and b'The server encountered an internal error or miscon' not in r.content:
                #     if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                #             bd_url.append(res_url)
                # except Exception as error:
                #     pass
        except:
            pass
    return list(set(bd_url))

ini configuration file

The core of the script: Users filter out the desired URLs through the content of the custom configuration file

[User]
# Username of the program
whoami = JF

# state friend link crawling, 0 closed, 1 enabled
# Other:
# None does not check this keyword
# Support or (or) logic, that is, |
# Priority: URL black list > URL white list > Title black list > Title white list > Web content black list > Web content white list

[Config]
# Friend link crawling
state = 0

# Black list of URLs
black_url = None

# White list of URLs
white_url = None

# Black list of titles
black_title = None

# White list of titles
white_title = 安全狗

# Black list of web page content
black_content = None

# White list of web page content
white_content = None

# Connection timeout 5 seconds
timeout = 5

User rule tree

Function written in the configuration file

Method one: parameter is a list of urls

def RuleUrl(urls):
    ruleurls = []
    UA = random.choice(headerss)
    header = {'User-Agent': UA, 'Connection': 'close'}
    # The first step, limit URL by blacklist, exclude all that are in the blacklist
    black_url_or = []
    for url in urls:
        if black_url == 'None':
            black_url_or.append(url)
        elif '|' in black_url:
            black_url_key = black_url.split('|')
            if all(key not in url for key in black_url_key):
                black_url_or.append(url)
        else:
            black_url_key = black_url
            if any(key not in url for key in black_url_key):
                black_url_or.append(url)

    # The second step, limit URL by whitelist, only save those that appear in the whitelist
    white_url_or = []
    for url in black_url_or:
        if white_url == 'None':
            white_url_or.append(url)
        elif '|' in white_url:
            white_url_key = white_url.split('|')
            if any(key in url for key in white_url_key):
                white_url_or.append(url)
        else:
            white_url_key = white_url
            if all(key in url for key in white_url_key):
                white_url_or.append(url)

    # The third step, filter website title by blacklist, exclude all that are in the blacklist
    black_title_or = []
    for url in white_url_or:
        if black_title == 'None':
            black_title_or.append(url)
        elif '|' in black_title:
            black_title_key = black_title.split('|')
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    title = re.findall('<title>(.*?)</title>', content, re.S)
                    if all(key not in title[0] for key in black_title_key):
                        black_title_or.append(url)
            except:
                pass
        else:
            black_title_key = black_title
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    title = re.findall('<title>(.*?)</title>', content, re.S)
                    if all(key not in title[0] for key in black_title_key):
                        black_title_or.append(url)
            except:
                pass

    # The fourth step, filter website title by whitelist, only save those that appear in the whitelist
    white_title_or = []
    for url in black_title_or:
        if white_title == 'None':
            white_title_or.append(url)
        elif '|' in white_title:
            white_title_key = white_title.split('|')
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    title = re.findall('<title>(.*?)</title>', content, re.S)
                    if any(key in title[0] for key in white_title_key):
                        white_title_or.append(url)
            except:
                pass
        else:
            white_title_key = white_title
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    title = re.findall('<title>(.*?)</title>', content, re.S)
                    if all(key in title[0] for key in white_title_key):
                        white_title_or.append(url)
            except:
                pass

    # The fifth step, filter web content by blacklist, exclude all that appear in the blacklist
    black_content_or = []
    for url in white_title_or:
        if black_content == 'None':
            black_content_or.append(url)
        elif '|' in black_content:
            black_content_key = black_content.split('|')
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    if all(key not in content for key in black_content_key):
                        black_content_or.append(url)
            except:
                pass
        else:
            black_content_key = black_content
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    if any(key not in content for key in black_content_key):
                        black_content_or.append(url)
            except:
                pass
    # Step six, filter website content whitelist, only save the urls that appear in the whitelist
    white_content_or = []
    for url in black_content_or:
        if white_content == 'None':
            white_content_or.append(url)
        elif '|' in white_content:
            white_content_key = white_content.split('|')
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    if any(key in content for key in white_content_key):
                        white_content_or.append(url)
            except:
                pass
        else:
            white_content_key = white_content
            try:
                r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
                encoding = requests.utils.get_encodings_from_content(r.text)[0]
                content = r.content.decode(encoding)
                if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
                    if all(key in content for key in white_content_key):
                        white_content_or.append(url)
            except:
                pass

    return white_content_or

Method two: Modify the received parameter to a single URL

Although the overall logic of the first version has been implemented, the efficiency is too slow, and this version has changed the overall logic
Modify the received parameter to a single URL, use true/false to judge whether the input URL meets the conditions, and then implement concurrency

def rule_url(url):
    # The first step, limit URL by blacklist, exclude all that are in the blacklist
    if black_url != 'None' and (any(key in url for key in black_url.split('|'))):
        return False

    # The second step, limit URL by whitelist, only save those that appear in the whitelist
    if white_url != 'None' and (all(key not in url for key in white_url.split('|'))):
        return False

    try:
        UA = random.choice(headerss)
        header = {'User-Agent': UA, 'Connection': 'close'}
        r = requests.get(url=url, headers=header, verify=False, timeout=timeout)
        if r.status_code == 200 or r.status_code == 301 or r.status_code == 302:
            encoding = requests.utils.get_encodings_from_content(r.text)[0]
            content = r.content.decode(encoding)
            title = re.findall('<title>(.*?)</title>', content, re.S)[0]
            # The third step, filter website title by blacklist, exclude all that are in the blacklist
            if black_title != 'None' and (any(key in title for key in black_title.split('|'))):
                return False
            # The fourth step, filter website title by whitelist, only save those that appear in the whitelist
            if white_title != 'None' and (all(key not in title for key in white_title.split('|'))):
                return False

            # The fifth step, filter web content by blacklist, exclude all that appear in the blacklist
            if black_content != 'None' and (any(key in content for key in black_content.split('|'))):
                return False
            # Step six, filter website content whitelist, only save the urls that appear in the whitelist
            if white_content != 'None' and (all(key not in content for key in white_content.split('|'))):
                return False

            return url
        else:
            return False
    except:
        return False
    return False

Entry function

# Program entry
def Result():
    print(f'Current user: {whoami}')
    if state == '0':
        print(f'[-] Friend link crawling: disabled')
    elif state == '1':
        print(f'[+] Friend link crawling: enabled')
    else:
        print(f'[x] Friend link crawling: Enter 0/1!')
    if black_url == 'None':
        print(f'[-] URL blacklist: disabled', end='')
    else:
        print(f'[+] URL blacklist: enabled', end='')
    if white_url == 'None':
        print(f'  [-] URL whitelist: disabled')
    else:
        print(f'  [+] URL whitelist: enabled')

    if black_title == 'None':
        print(f'[-] Title blacklist: disabled', end='')
    else:
        print(f'[+] Title blacklist: enabled', end='')
    if white_title == 'None':
        print(f'  [-] Title whitelist: disabled')
    else:
        print(f'  [+] Title whitelist: enabled')

    if black_content == 'None':
        print(f'[-] Website blacklist: disabled', end='')
    else:
        print(f'[+] Website blacklist: enabled', end='')
    if black_content == 'None':
        print(f'  [-] Website whitelist: disabled')
    else:
        print(f'  [+] Website whitelist: enabled')
    print('='*50)
    print('0: Keyword scanning       1: Import text scanning')
    try:
        num = int(input('Please select the startup mode (0/1):'))
        if num == 0:
            rurls = set()
            keywor = input('Please enter the keyword:')
            t1 = time.time()

            bd = bd_urls(keywor, num)
            rule_bd_urls = set()
            with ThreadPoolExecutor(max_workers=10) as executor:
                results_bd = executor.map(rule_url, bd)
            with lock:
                for url in results_bd:
                    if url:
                        rule_bd_urls.add(url)

            # Check each URL for friendship links
            l_urls = set()
            rule_link_urls = set()
            for url in rule_bd_urls:
                link_urls = get_links(url)
                l_urls.update(link_urls)
            with ThreadPoolExecutor(max_workers=10) as executor:
                results_link = executor.map(rule_url, l_urls)
            with lock:
                for url in results_link:
                    if url:
                        rule_link_urls.add(url)
            # #You can directly save a set or list to reduce IO
            with open(f'{keywor}_url.txt', 'a+', encoding='utf-8') as a:
                a.write('\n'.join(rule_link_urls))
            t2 = time.time()
            print(f'Scanning completed, time taken: {t2-t1}, results saved to [{keywor}_url.txt])
        elif num == 1:
            print('Hint: The URLs in the text must include the protocol type, such as: http/https')
            urls = set([url.strip() for url in open(input('Drag the required url to this window:'),'r',encoding='utf-8')])
            print(f'Total {len(urls)} sites found in text, scanning in progress...')
            result = set()
            if state == '0':
                t1 = time.time()
                with ThreadPoolExecutor(max_workers=10) as executor:
                    results_link = executor.map(rule_url, urls)
                with lock:
                    for url in results_link:
                        if url:
                            result.add(url)
            elif state == '1':
                t1 = time.time()
                l_url = set()
                for url in urls:
                    l = get_links(url)
                    l_url.update(l)

                with ThreadPoolExecutor(max_workers=10) as executor:
                    results_link = executor.map(rule_url, l_url)
                with lock:
                    for url in results_link:
                        if url:
                            result.add(url)
            with open(filename, 'a+', encoding='utf-8') as a:
                a.write('\n'.join(result))
            t2 = time.time()
            t = str(t2-t1).split('.')[0]
            print(f'Scanning completed, time taken: {t}s, results saved to [{filename}]')

        else:
            print('Input error, program terminated!')

    except Exception as e:
        print(f'Input error, program terminated, error type: {e}')

if __name__ == '__main__':
    title = '''
           _ ______   _    _ _____  _
          | |  ____| | |  | |  __ \| |
          | | |__    | |  | | |__) | |
      _   | |  __|   | |  | |  _  /| |
     | |__| | |      | |__| | | \ \| |____ 
      \____/|_|       \____/|_|  \_\______|
    '''
    #print(title)
    Result()

Use for testing

The script has been tested on more than 10,000 sites, and it is currently running normally

Configuration file config.ini

  • None does not check the keyword, only supports or logic, that is, the symbol |

  • Not checking is represented by None, the field cannot be left blank, otherwise the script cannot run normally

  • state only supports 0/1, 0 closes the friend link crawling of imported text, 1 opens the friend link crawling of imported text

  • Keyword priority: Blacklist URL > Whitelist URL > Blacklist Title > Whitelist Title > Blacklist Web Content > Whitelist Web Content

Demonstration:
Note: After crawling, the results are saved in txt format in the current directory

1. Perform crawling on educational sites through search engines

image.png

2. First filter out educational sites from the imported text, and do not perform friend link crawling

image.png

3. First filter out educational sites from the imported text, and then perform friend link crawling

image.png

Conclusion

By the above code, you can complete the script. If you don't want to trouble, you can also download the one I have packaged. Currently, the script has been uploaded to github: https://github.com/JiangFengSec/JF_URL. Those who are interested can download and try it. If it helps you, please give it a stars. Thank you!

你可能想看:

It is possible to perform credible verification on the system boot program, system program, important configuration parameters, and application programs of computing devices based on a credible root,

(3) Is the national secret OTP simply replacing the SHA series hash algorithms with the SM3 algorithm, and becoming the national secret version of HOTP and TOTP according to the adopted dynamic factor

Article 2 of the Cryptography Law clearly defines the term 'cryptography', which does not include commonly known terms such as 'bank card password', 'login password', as well as facial recognition, fi

5. Collect exercise results The main person in charge reviews the exercise results, sorts out the separated exercise issues, and allows the red and blue sides to improve as soon as possible. The main

Announcement regarding the addition of 7 units as technical support units for the Ministry of Industry and Information Technology's mobile Internet APP product security vulnerability database

4.5 Main person in charge reviews the simulation results, sorts out the separated simulation issues, and allows the red and blue teams to improve as soon as possible. The main issues are as follows

As announced today, Glupteba is a multi-component botnet targeting Windows computers. Google has taken action to disrupt the operation of Glupteba, and we believe this action will have a significant i

d) Adopt identification technologies such as passwords, password technologies, biometric technologies, and combinations of two or more to identify users, and at least one identification technology sho

Ensure that the ID can be accessed even if it is guessed or cannot be tampered with; the scenario is common in resource convenience and unauthorized vulnerability scenarios. I have found many vulnerab

Internal and external cultivation | Under the high-confrontation offensive and defensive, internal network security cannot be ignored

最后修改时间:
admin
上一篇 2025年03月25日 14:04
下一篇 2025年03月25日 14:27

评论已关闭