Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Metadata configured, but Google only shows URL with sitelinks. Something wrong with my code?
Hi guys! I have a metadata problem with my home page. If I look for the brand's keyword, the SERPs don´t show the metadata I configured, instead it shows the URL with sitelinks. If I use the "site:" command, it doesn't appear at all. This happens only on the home page, not the rest, which are roughly 700 pages. Those appear fine. I already have a meta title and meta description configured, which include the mentioned KW. It used to appear correctly before. GSC shows it indexed. Most audit tools (configured to crawl JS) detect the metadata. Moz's On Page tool doesn't. Could it be because of the JS configuration? Or am I missing something else? Here´s the meta description code:What do you think? I'd appreciate your input. Thanks!
Technical SEO | | Reprise0 -
422 vs 404 Status Codes
We work with an automotive industry platform provider and whenever a vehicle is removed from inventory, a 404 error is returned. Being that inventory moves so quickly, we have a host of 404 errors in search console. The fix that the platform provider proposed was to return a 422 status code vs a 404. I'm not familiar with how a 422 may impact our optimization efforts. Is this a good approach, since there is no scalable way to 301 redirect all of those dead inventory pages.
Technical SEO | | AfroSEO0 -
Do I have a problem with missing pages in Screaming Frog?
We have category pages and some of those pages have pagination due to us having additional items. Screaming Frog could not find the items that were after page 1. Is this a problem for Google? These item pages are still in the sitemap. I am sure they can find them to index them but does it hurt rankings at all.
Technical SEO | | EcommerceSite0 -
Google Publisher status
Hi all, I just wondered what the general opinion was with regard getting Google publisher status for medium to large organisations. Lots of our clients write a lot of articles & publications and it would be interesting to get some thoughts on how others view Authorship & in particular Publisher credentials. Thanks!
Technical SEO | | davidmaxwell0 -
Exclude root url in robots.txt ?
Hi, I have the following setup: www.example.com/nl
Technical SEO | | mikehenze
www.example.com/de
www.example.com/uk
etc
www.example.com is 301'ed to www.example.com/nl But now www.example.com is ranking instead of www.example.com/nl
Should is block www.example.com in robots.txt so only the subfolders are being ranked?
Or will i lose my ranking by doing this.0 -
Title tag code
Hi, I have a couple of websites where I can't define the title tag (CMS does not support it) on a few default pages. On these pages "the system" just uses the primary/main title tag (from the frontpage) and my programming skills (as if I have any...!) have not been able to make a html code or something to override the main title tag on these specific pages. Does this make sense at all and can anyone give me a hint, a code to try out or something? Problem is that I now have 3 pages with the same title tag which in terms of SEO isn't too good, so to say... Thanks in advance. Jan
Technical SEO | | Wello12340 -
Exclude mobile pages from non mobile Google serps
Hi Everybody I see that a lot of our pages on our mobile shop has started to turn up when i do site:domainname.com on google. As they could potentially compete with the similar non mobile version of the same page, is there some way to exlude the mobile domain in non mobile google result without blocking the mobile version altogether. We use an m.domain.com version for our mobile site.
Technical SEO | | AndersDK0 -
Code for redirect
What is the code to redirect www.xyz.com/abc where abc is a folder to www.xyz.com/abc.html
Technical SEO | | seoug_20050