Recently, on an application using SignalR, we found that we were getting many
System.InvalidOperationException: The ConnectionId is in the incorrect format exceptions (sometimes 10 per second) from the SignalR hubs in the site.
While there is plenty of discussion on the causes of this error in SignalR documentation and on Q&A sites, they mostly discuss problems with changing levels of authentication (SignalR defines a new connection when the auth level, or authorized user, changes). In this case though this wasn’t relevant. The volume implied that this was unlikely to be the cause, and in anycase, the site was already written to cope with these changes on login.
The cause turned out to be search engine crawler services.
The culprit – search engine crawlers
On inspecting the logs from the site, and logged exception/context information, it was clear that the site was being visited by the crawlers for the various search engines, and that these were not interacting well with SignalR.
- The crawler sends the wrong content to the SignalR endpoint. This is the root cause of the ConnectionId error.
- The crawler doesn’t correctly handle the polling mechanisms used by SignalR, in particular long-polling, and just sit continuously calling against the endpoint, generating large volumes of requests and their related errors.
In the case of the site that had the issue in this case, this was causing severe performance problems due to the small scale of the site/web server and the number of requests that were coming in.
The fix: robots.txt
Happily the solution is simple enough, and uses robots.txt, the text file that can be placed in a site to specify the required behaviour of visiting web crawlers. You can read more about this file on this site if you’re not familiar with it.
The fix then is to add a robots.txt file to the site:
- Create a robots.txt file in the root of the web application
- Add content to the file based on the options below
- Check that the robots.txt file is visible by visiting the site using the site root url + “robots.txt” (eg http://jonegerton.com/robots.txt)
- If your site uses authentication it may be necessary to make sure that robots.txt can be reached by a non-authenticated session
The content used in robots.txt is straightforward:
Allow indexing except for SignalR
Adding the following to robots.txt will tell the crawler to ignore SignalR, but index the rest of the site
User-agent: * Disallow: /signalr/
If you don’t want your site to be crawled at all by search engines (say if its a member-login type site) then you can do this by adding:
User-agent: * Disallow: /
While crawlers don’t have to respect the contents of robots.txt (it’s more of an advisory), the main search engines will do so, and you should see the exceptions stop occurring (and site performance improve if the load from the crawlers was causing any impact).