Few weeks ago I was working on a sitemap implementation for a website still using Sitecore 8.2 and I decided to use one of the available modules on the Sitecore Marketplace. I chose the Sitemap XML module because I was already familiar with the tool and I had already verified that it was compatible with this version of Sitecore.
An important requirement, that I was looking for, was the ability to exclude all pages not accessible by anonymous users from the content of the generated sitemap file. It doesn’t make sense to have a search engine to crawl restricted pages of a website, if the content of the actual pages is secured and accessible only by authenticated users with specific roles. The search engine would just crawl the login page or a “no access” page where the web application would redirect the crawling bot to.
The problem
I reviewed the module documentation and it seemed that the Sitemap XML module was going to support this feature out of the box. I installed it and tested it and…sigh!, it didn’t go as expected. Restricted pages were getting listed in the generated sitemap file. It was time to dig into the module code and troubleshoot.
The module uses the Sitecore.Security.Accounts.UserSwitcher() method to switch the current context user to be the anonymous user of the website, before querying the items of the website to be used in the sitemap generation process. This process is triggered after a publishing event ends and it is executed on the Sitecore shell context, using the logged in Sitecore user that initiates the content publishing. In this context the item security access restrictions were completely ignored and all items seemed to be always accessible by the anonymous user.
The fix
I knew that the UserSwitcher() was working fine in the website context, because it was already used in one of the website components, so I decided to rely on the website context to verify if the anonymous user had access to a website page item when generating the sitemap. I implemented a simple HTTP handler to just perform the user access check to an item and return a boolean value, true or false, in simple text format in the response.
public class SitemapUserAccessHandler : IHttpHandler
{
public void ProcessRequest(HttpContext context)
{
var itemPath = context.Request.QueryString["item"];
Sitecore.Security.Accounts.User user = Sitecore.Security.Accounts.User.FromName(@"extranet\Anonymous", false);
global::Sitecore.Context.Site = global::Sitecore.Configuration.Factory.GetSite("website");
if (!string.IsNullOrEmpty(itemPath))
{
context.Response.ContentType = "text/plain";
var item = Sitecore.Context.Site.Database.GetItem(itemPath);
if (item != null)
{
context.Response.Write(item.Security.CanRead(user));
}
else
{
context.Response.Write("False");
}
}
}
public bool IsReusable
{
get
{
return false;
}
}
}
The performance improvement
The sitemap generation process in the Sitemap XML module is triggered synchronously at the end of the normal Sitecore publishing process, increasing the time that will normally take to publish an item in Sitecore. The HTTP handler needed to be fast to not affect too much the publishing experience, since it was adding its request life cycle time for each processed item to the total sitemap generation time.
The HTTP handler was taking only 3 milliseconds to execute a request and that seemed pretty fast. Well, it’s all relative! For a website with less than 1,000 pages, it was adding no more than 3 seconds to the entire process. But for a website with about 5,000 pages, it was adding about 15 seconds to the entire process, negatively impacting the normal publishing experience in Sitecore.
If a website has a scaled architecture with separate content management and content delivery environments, like the website that I was working on, the publishing process can be completely decoupled by the sitemap generation process, configuring the module to trigger the sitemap generation process only when the remote publishing event ends:
<events timingLevel="custom">
<event name="publish:end:remote">
<handler type="Sitecore.Modules.SitemapXML.SitemapHandler, Sitemap.XML" method="RefreshSitemap" />
</event>
</events>
In this way, the publishing event that triggers the sitemap generation process occurs on the content management environment, while the actual sitemap generation process occurs on the content delivery environment only, with zero impact on the user experience in Sitecore.
Conclusions
In this blog post I described a solution to exclude restricted pages from a generated sitemap when using the Sitemap XML module. This solution is best suited for scaled Sitecore websites where the sitemap generation process can be executed on Content Delivery environments only. If you have any questions, or want to share your different implementation approach, please don’t hesitate to comment on this post.
Thank you for reading!