XML-sitemap generator and sh404sef security-settings (Troubleshooting)

20013-03-20 SiteMapGenerator-Security sh404sef I was not content with the several sitemap generators for Joomla! which I had tried out. In a chat with Jan Giselberg (joomla-downloads.de) he told me about an external (not integrated in Joomla!) Sitemap Generator (http://www.xml-sitemaps.com), enough reason for me to give it a try.

The basic configuration was easy and a little later I generated my first sitemap.xml. Sadly not all of my links were included. The site is not huge, but it has a little more content than that. That meant RTFM (read the fucking manual) - so I was searching the forums but did not get a real hit of what happened. While I watched another generation procedure, I discovered that the pages where found, but still did not make it into the sitemap.
So I took a look at the change log (one of the tabs) showed me that 69 pages where not included


Equipped with this new knowledge I consulted the forum again. Now I found something useful (which I could/should have discovered on own ...) there is also a detailed version of the log. This brought me the vital clue. All of that 69 pages were marked as 403 FORBIDDEN.

joomla/jug-hamburg.html - 403 FORBIDDEN
joomla/entwicklung.html - 403 FORBIDDEN
blog.html - 403 FORBIDDEN
blog/meldungen-dewesode.html - 403 FORBIDDEN
kontakt.html - 403 FORBIDDEN

Then I remembered that I had installed the Joomla! component sh404sef, which has also security settings that can prevent flooding (too many request from one IP in a short time).

Quick test:
When I deactivated the security-features in sh404sef, the sitemap was flawlessly generated.
Because I hate to give on extra security, I searched for a setting that would allow me to keep the security and run the sitemap tool. I found two possible solutions.

Solution 1: enter the own IP into the White-List

2013-03-20 sitemapGenerator vs sh404sef securityIn the sh404sef control center (control-center > settings > security) Take a look at the setting "IP White List" and then enter the IP of your website (example: a general setting like localhost or the domain-name will not work. That is also the drawback of that method. If you have our own IP (most probably if you have an SSL-encription running) that is not a problem. But when your project is hosted on one of the big shared hosters with thousands of web-spaces on one machine and all of them have the same IP as you, you may want to consider the second solution provided.

Scenario: The neighbor web-space has been hacked (logically it has the same IP as your Webspace) and starts attacking your website than these requests would not bee stopped, because that IP is on the white list. Ok there are several bad things that have to happen but how was that nice quote?

"The question is not if you are paranoid, the question is, are you paranoid enough?"

Solution 2: splitting the crawling in several chunks

2013-03-20 sitemapGenerator vs sh404sef SettingsThe sitemap generator lets you break down the indexing in smaller chunks and lets you set a delay between segments. So you can configure it in a way that it will not trigger the flooding-criteria. Maybe this can be advised anyway with some not so performant hosters or very huge websites. The drawback here is that the generation process takes longer. So you may want to trigger it as a cron-job.


Generally speaking

Always test after having changed a single parameter. Otherwise you will not know what settings broke the configuration. Set the parameters from open to more and more restricted so that the sitemap.xml will be barley (but securely) generated.