8 basic methods of automating the collection of information from company websites

Today we're going to talk a little more about the problem of finding URLs on a particular domain. In last week's article I described in detail how to use the Python Site Map Generator tool (https://medium.com/osint-ambition/how-to-use-python-sitemap-generator-for-osint-77bc69fa165d), and today I'll cover some of the nuances of using Katana.

Katana (https://github.com/projectdiscovery/katana) is a very simple and fast open source crawling tool written in Go. It is created by Project Discovery, the same company as the Nuclei vulnerability scanner, which you can read about in the article: Using Nuclei for OSINT. 5-minute basic guide.

Basic usage

None

Let's install:

go install github.com/projectdiscovery/katana/cmd/katana@latest
None

And run help:

katana -h

If something suddenly goes wrong for you, you probably don't have the Go language installed on your system:

Go installation instructions.

None

Let's run the most basic variant of crawling to get a list of URLs for a domain:

katana -u sector035.nl

If you paid attention to the help command image, you'll notice that Katana has a huge number of settings and options. But for an OSINT specialist (NOT a pentester or bughunter) only a few of them are most useful. Below I'll talk about them in more detail.

1. Crawl URLs for list of domains

None

To collect URLs for multiple domains at the same time, save their list in the domains.txt text file and run Katana using the — list flag:

katana -list domains.txt

2. Filter results

None

Firstly, you can filter links by file extension. Use the -em flag to include results only for a specific file type:

katana -u sector035.nl -em js
None

Conversely, if you want to exclude one or more file types from the results, use the -ef flag:

katana -u sector035.nl -ef js,css

For example, in this way you can filter images, MS Office documents, PDFs etc

Also it's possible to filter the results using regular expressions. The simplest example is filtering out links containing certain keywords.

None

Use -mr flag to include regex matches:

katana -u osintframework.com -mr .*d3.*

And -fr flag to exclude regex matches:

katana -u osintframework.com -fr .*d3.*

3. Save results to JSON

None

For convenience of further processing, it is better not to output the results to the command line, but save them to a file using ">". Use the -j flag to convert the results to JSON format:

katana -u osintframework.com -j >result.json

In addition to the link to the page, the file will also store the request body, request time, server information and other parameters.

4. JavaScript endpoints parsing

None

If you want to collect more information about the site (for example, to better understand the structure of how it works), then activate the collection of endpoints from JavaScript files using the -jc flag:

katana -u lidl.com -jc

For example, it may help to find to unknown directories where you can then search for files using GoBuster (https://medium.com/the-first-digit/how-to-use-gobuster-for-osint-905bc9360024) and wordlists.

5. Other useful flags

In conclusion, a few more useful flags that you may find useful:

  • -kf link to sitemap.xml and robots.txt (to speed up the work and possibly increase the number of results, add links to the files you know, describing the structure of the site)
  • -d 5 (maximum depth to crawl, default 3. Increase the value if you want more detailed results and decrease it if you want faster work at the expense of detail)
  • -proxy (if your IP will be blocked due to more requests, you can solve this problem by using one or more proxy servers)

Here is the end of another article on the basics of OSINT automation on my blog. If you are interested in this topic, I recommend you to read my short and free books about it:

Python for OSINT. 21-day course for beginners

Linux for OSINT. 21-day course for beginners