There are several factors you might need to uncover many of the URLs on a web site, but your actual goal will decide Whatever you’re hunting for. As an example, you may want to:
Establish each and every indexed URL to research difficulties like cannibalization or index bloat
Obtain recent and historic URLs Google has viewed, especially for site migrations
Find all 404 URLs to Get well from put up-migration problems
In Each and every scenario, a single Device received’t give you everything you will need. Regretably, Google Look for Console isn’t exhaustive, and a “internet site:illustration.com” research is proscribed and tough to extract info from.
On this write-up, I’ll walk you thru some instruments to construct your URL checklist and before deduplicating the data utilizing a spreadsheet or Jupyter Notebook, depending on your site’s measurement.
Old sitemaps and crawl exports
Should you’re in search of URLs that disappeared through the Reside site a short while ago, there’s an opportunity another person on your workforce might have saved a sitemap file or simply a crawl export before the adjustments ended up designed. If you haven’t already, look for these information; they will typically give what you may need. But, in case you’re looking through this, you probably did not get so Fortunate.
Archive.org
Archive.org
Archive.org is a useful Software for Search engine optimisation responsibilities, funded by donations. If you hunt for a domain and select the “URLs” alternative, you could entry around ten,000 shown URLs.
Even so, There are several limits:
URL limit: You could only retrieve as many as web designer kuala lumpur 10,000 URLs, which is insufficient for larger sized web sites.
High-quality: A lot of URLs might be malformed or reference useful resource documents (e.g., photos or scripts).
No export choice: There isn’t a crafted-in strategy to export the list.
To bypass The shortage of an export button, make use of a browser scraping plugin like Dataminer.io. However, these limitations suggest Archive.org may not offer a complete Answer for bigger internet sites. Also, Archive.org doesn’t point out no matter whether Google indexed a URL—but if Archive.org observed it, there’s a superb opportunity Google did, far too.
Moz Professional
Although you could ordinarily use a url index to search out exterior web pages linking to you, these resources also learn URLs on your internet site in the method.
The way to use it:
Export your inbound hyperlinks in Moz Professional to secure a brief and straightforward list of concentrate on URLs from the internet site. In the event you’re working with an enormous website, think about using the Moz API to export information further than what’s manageable in Excel or Google Sheets.
It’s imperative that you Observe that Moz Pro doesn’t validate if URLs are indexed or discovered by Google. Even so, because most web-sites apply the exact same robots.txt guidelines to Moz’s bots because they do to Google’s, this process normally will work well as a proxy for Googlebot’s discoverability.
Google Search Console
Google Search Console delivers many worthwhile sources for creating your list of URLs.
Back links reviews:
Similar to Moz Professional, the Back links area gives exportable lists of concentrate on URLs. Sad to say, these exports are capped at 1,000 URLs each. You may utilize filters for specific pages, but given that filters don’t use for the export, you would possibly need to depend on browser scraping tools—restricted to 500 filtered URLs at a time. Not suitable.
General performance → Search Results:
This export gives you a summary of web pages acquiring search impressions. When the export is proscribed, You should use Google Search Console API for more substantial datasets. There are also totally free Google Sheets plugins that simplify pulling a lot more in depth data.
Indexing → Web pages report:
This segment presents exports filtered by situation type, while they're also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent source for collecting URLs, with a generous Restrict of one hundred,000 URLs.
Better still, you may implement filters to build distinctive URL lists, proficiently surpassing the 100k limit. For example, if you want to export only site URLs, adhere to these methods:
Move one: Increase a phase into the report
Move two: Click “Make a new phase.”
Phase 3: Determine the section by using a narrower URL pattern, such as URLs made up of /website/
Be aware: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.
Server log information
Server or CDN log data files are Maybe the last word Device at your disposal. These logs seize an exhaustive record of every URL path queried by end users, Googlebot, or other bots in the recorded interval.
Factors:
Knowledge measurement: Log information may be massive, so many sites only retain the last two months of knowledge.
Complexity: Analyzing log information is usually tough, but many resources can be found to simplify the procedure.
Combine, and excellent luck
When you finally’ve collected URLs from these resources, it’s time to combine them. If your site is sufficiently small, use Excel or, for larger sized datasets, applications like Google Sheets or Jupyter Notebook. Assure all URLs are continuously formatted, then deduplicate the checklist.
And voilà—you now have an extensive listing of existing, outdated, and archived URLs. Good luck!