Exclude status codes in Screaming Frog

DonnaDuncan

I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.

The site has approximately 190,000 pages according to the results of a Google site: command.

The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.

Does anyone know how to exclude files using status codes? I know that would help.

If it helps, the site is kodylighting.com.

Thanks in advance for any guidance you can provide.

CHAD215

Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.

MickEdwards

Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.

Also from support if failing on all fronts:

Mac version, please make sure you have the most up to date version of the OS which will update Java.
Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.

To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs

CHAD215

does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis

DonnaDuncan

That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.

I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"

Thanks again Michael. Your thoroughness and follow through is appreciated.

MickEdwards

Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.

DonnaDuncan

Thank you Michael.

You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.

If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.

MickEdwards

I don't think you can filter out on response codes.

However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Exclude status codes in Screaming Frog

Browse Questions

Explore more categories

Related Questions

Handling Pages with query codes

How to redirect 302 status to 301 status code using wordpress

Do I have a problem with missing pages in Screaming Frog?

HTTP Status showing up in opensiteexplorer top pages as blocked by robot.txt file

After I 301 redirect duplicate pages to my rel=canonical page, do I need to add any tags or code to the non canonical pages?

Webmaster Index Status - Not Selected > Ever Crawled

Coding - where to start?

Are recipes excluded from duplicate content?