I have had mixed feelings about "The Summer of Analytics". On one hand guys like Eric Tulsky and Tyler Dellow have done amazing work when it comes to hockey analytics and have truly earned their jobs with their respective NHL teams. On the other hand the analytics community lost some extremely intelligent and insightful minds. Today (August 13, 2014) the alaytics community "lost" another resource thanks to an adjustment to the nhl.com Terms of Service (TOS).
The portion of the TOS I am speaking of is the second bullet point under "Section 2: Prohibited Content and Activities" and it reads:
You may not access or use, or attempt to access or use, the Services to take any action that could harm us or any other person or entity (each a "person"), interfere with the operation of the Services, or use the Services in a manner that violates any laws. For example, you may not:
- Impersonate any person or falsely state or otherwise misrepresent your credentials, affiliation with any person, or the origin of any information you provide;
- Engage in unauthorized spidering, scraping, or harvesting of content or information, or use any other unauthorized automated means to compile information;
This little addition to the TOS has some major impacts to hockey analytics as a whole. It directly impacts how we've gathered information for the past 8 years.
What is Scraping?
Before I go any further I wanna cover off this little bit of information. For those of you without a programming background or new to the analytics community you may not fully grasp what is being prohibited here so I'll explain it as best I can.
Almost every NHL game going back to the 2007/08 season has game sheets available on the nhl.com website, the list includes Play by Play, Shots, Roster, Summary and many more. These are referred to by the NHL as "real-time scoring statistics" (RTSS) and are the backbone of the anlytics movement in particular the Play by Play page.
The Play by Play page (see an example) breaks down "every" trackable event in a game. Everything from a goal to a missed shot, from a hit to a faceoff win or loss is tracked on the Play by Play page. More importantly though is that it also includes who was on the ice for each event, this is how corsi and fenwick are tracked per player. There is also a myriad of other stats that are tracked using these Play by Play pages and are it is the building blocks of hockey analytics.
Going back in time to the early days of the analytics movement, guys like Vic Ferrari and Gabriel Desjardins came up with simple tools to parse through all of the raw HTML, compile it and put it into an easy to read format. This method of pulling the data from the website is also known as scraping.
Fast forward to the present day and sites like extraskater.com, hockey-reference.com and stats.hockeyanalysis.com have taken those same concepts and expanded on them or given us a nicer view of the datat. All of them are using some variation of scraping method to obtain some if not all of their data. Other smaller projects also use some variation of scraping to pull their data from the nhl.com website as well.
What is the Issue with Scraping?
Scraping is used all the time in programming. In fact you've probably used other non-hockey sites that implement scraping. Maybe you've used one to book a flight or a hotel or a car rental. The difference is that those sites have agreements with the service provider to scrape the data and have paid big dollars to do so.
The problem with unauthorized scraping is that the information being scraped usually has some sort of copyright associated with it. The copyright holder has the right to their content and their data so they can do what they please with it. There have also been court cases won/lost (depending on which side they were on) over a TOS that didn't include this little bullet point. That is at the core of this change to TOS, the NHL wants to protect its copyright, but in the NHL's case there may be another underlying issue.
When data is scraped from the nhl.com website and put onto Extra Skater or Behind the Net it takes away viewers and clicks from nhl.com. This impacts advertising revenue, as small as is may be, which in turn impacts overall revenue. And we all know how much the NHL (or any organization for that matter) hates to lose revenue (see 1994, 2005 and 2012). That is why they needed to adjust this in the TOS.
What is the Impact?
It is a bit early at this point to know what the direct impacts are going to be for certain. For all we know, some of these sites may have already worked out deals with the NHL in anticipation of these changes. But if they haven't then there may be changes to some of our favourite sites in the near future.
There is also the potential that over the past few seasons the NHL has taken noticed that "advanced stats" have started to gain traction. The league may be looking at adding "advanced stats" to their site and they want to keep all of that information in house to sell in the future.
This is obviously a double edged sword much like the rest of "The Summer of Analytics" has been. It is a major step forward in general acceptance of these stats but as with any big organization advances and changes take a long time to implement so the detail may not be there.
What can we do?
The short answer is not a whole heck of a lot when it comes to the NHL's TOS. In the end the NHL has every right to do what it wants with its information. Chances are that the decision to change the TOS are a combination of what I mentioned above along with some other reasons we aren't privy to. So trying to get the NHL to change it's TOS is a non-starter but there are still things we can do.
Right now I know of one project but I would love to add to the list. If you know of any just add them into the comments and I will add them or better yet write a separate post to list the projects.
- Corey Sznajder is running his "All Three Zones" project that is tracking zone entries and exits for the entire 2013-14 season. Go make a donation to the project. Stuff like this only helps to broaden our understanding of the game.
It's tough to say what the future holds but I know that this is going to have an impact no matter how you look at it. The only thing I know for certain is that this change will just push the community to come up innovations to track this data. There will be new ideas on how track the data and better results will come of it.