Wednesday, August 16, 2017

A revision of HTML reports from the NHL website.

So, while revising and revisiting the data files, I did an extra scan of the HTML reports available on NHL.com .

The HTML reports are extracted through the following URL pattern: 

sprintf('http://www.nhl.com/scores/htmlreports/%d%d/%s%02d%04d.HTM', $season, $season+1, $type, $stage, $game_id) 

where $season is the start year of the season, $type is the type of the report, $stage is 2 for regular, 3 for playoff, and $game_id is the NHL game id for the season.

Below are the results for the problematic games, with the reports that are beyond salvage marked accordingly:

Season Stg GameId ES GS PL RO
1999 Reg 0029 M M N N
1999 Reg 0045 I I N N
1999 Reg 0050 I I N N
1999 Reg 0058 R R N N
1999 Reg 0071 I I N N
1999 Reg 0072 I I N N
1999 Reg 0081 I I N N
1999 Reg 0109 I I N N
1999 Reg 0130 I I N N
1999 Reg 0323 R R N N
1999 Reg 0619 I I N N
1999 Reg 0689 I I N N
1999 Reg 0690 B I N N
1999 Reg 0836 I I N N
1999 Reg 1034 I I N N
1999 P/O 0325 I I N N
2000 Reg 0029 R R N N
2000 Reg 0038 R R N N
2000 Reg 0039 I I N N
2000 Reg 0041 R R N N
2000 Reg 0042 R R N N
2000 Reg 0043 R R N N
2000 Reg 0044 R R N N
2000 Reg 0045 I I N N
2000 Reg 0049 R R N N
2000 Reg 0067 R R N N
2000 Reg 0072 I I N N
2000 Reg 0073 I I N N
2000 Reg 0077 I I N N
2000 Reg 0080 R R N N
2000 Reg 0083 R R N N
2000 Reg 0085 I I N N
2000 Reg 0095 R R N N
2000 Reg 0102 I I N N
2000 Reg 0112 R R N N
2000 Reg 0186 I I N N
2000 Reg 0187 I I N N
2000 Reg 0189 I I N N
2000 Reg 0920 R R N N
2000 Reg 0921 I I N N
2000 Reg 0924 R R N N
2000 Reg 0925 R R N N
2000 Reg 0926 R R N N
2000 Reg 0928 I I N N
2000 Reg 0983 B V N N
2000 Reg 1166 I I N N
2003 Reg 0191 V I V N
2003 Reg 1205 I I I N
2003 P/O 0134 V V B N
2005 Reg 0298 R V V N
2005 Reg 0458 B V V N
2005 Reg 0677 V V V B
2005 Reg 0679 V V V B
2005 Reg 0681 V V V B
2007 Reg 1178 I I B I
2008 Reg 0259 I I B I
2008 Reg 0409 I I B I
2008 Reg 1077 I I B I
2009 Reg 0081 I I B I
2009 Reg 0827 V I V V
2009 Reg 0836 V I V V
2009 Reg 0857 I I V V
2009 Reg 0863 I I V V
2009 Reg 0874 I I V I
2009 Reg 0885 I V V I
2010 Reg 0429 I I V V
2011 Reg 0259 I I V V
Legend:
I - incomplete (not through the end of the game)
B - broken (doesn't pass HTML parser)
M - misplaced (belongs to a different game that doesn't have a file associated with it)
R - replica (copy of a file from another game
N - not available (file not available on NHL.com)
V - file is good.

Friday, June 30, 2017

I'm not gone

For those few people who may be reading this -

This blog ain't dead. It's just that I've been so busy so far with my regular job, with my website rework and with moving into the new house, that I haven't a time to consider a proper post.

Stay tuned, though...

Sunday, May 28, 2017

Last features - and a summer freeze

During the month of May I added two new features for the website, naturally dedicated to the upcoming draft:

Draft Pick Stats
Draft Pick Success

Now it's time to step back, look around and improve the infrastructure behind the website, both in the matter of the collection and the publication of the data. Thanks to all the visitors both of this blog and of the website.

I will continue to publish my predictions for the remainder of the playoff games on Twitter - follow me @morehockeystats! You are also most welcome to send me your ideas for new unusual statistics.

So the projected plan for the summer months is:
June - improve and finally release the Perl scraper
July - improve the publishing mechanism for the website and speed it up
August - improve the Elo-based models behind the site's projections
September - add some new features

This blog is on no pause! I will continue to publish entries about my hockey-related thoughts as they come. Stay tuned!

Monday, May 1, 2017

On carrying Momentum...

Frequently, the importance of carrying momentum over an intermission can be heard being talked about. I thought it were possible to measure this harmony with algebra, so I tried to do that. I choose to analyze a very specific question:

If the regulation of a game ends in a tie, other than 0-0, how frequently would the team that tied the game with the last regulation goal win in overtime. 

We would define the team that tied the game as the one having the momentum. We would define the other team as the one trying to show resilience. For answering the question, we analyzed the outcome of games of seasons 2007/08-2016/17 (including the ongoing playoffs). We discard the games that end in a shootout, because their outcome depend truly more on the skill of the shooting players/goaltenders rather than the whatever momentum might've been accrued.

The results of the analysis are displayed in the table below, per season, per the time frame during which the last tying goal was scored: in the last two, five, or ten minutes, in the last period, or in one of the first two. The numbers show the percentage of wins by the team having the momentum and the number of games falling into that specific segment. Also we display a separate column and a separate row for playoffs game, although a finer granularity is not really possible because of the sample size.

Season   2        5        10       20       1st/2nd  total     totalPO
2007     54.2/24  57.9/19  52.9/34  53.8/13  52.6/38  53.9/128  31.2/16
2008     43.5/23  48.1/27  45.2/31  53.8/13  40.0/40  44.8/134  25.0/16
2009     42.9/28  56.5/23  72.7/22  64.7/17  53.7/41  56.5/131  58.8/17
2010     48.6/37  54.2/24  47.1/34  40.7/27  56.8/44  50.0/166  59.1/22
2011     50.0/24  45.8/24  43.5/23  72.0/25  47.7/44  51.4/140  37.5/24
2012     62.5/16  33.3/15  50.0/22  50.0/14  57.9/19  51.2/86   53.8/26
2013     58.1/43  43.5/23  34.6/26  45.5/22  44.1/34  46.6/148  70.8/24
2014     51.7/29  65.2/23  55.3/38  46.7/15  60.5/43  56.8/148  57.9/19
2015     60.0/40  46.7/30  44.4/36  45.8/24  39.6/48  47.2/178  52.6/19
2016     43.6/39  50.0/28  60.5/38  48.1/27  61.8/68  54.5/200  63.2/19
totalPO  61.4/44  46.7/30  55.8/43  68.4/19  40.9/66  52.0/202 52.0/202
total   51.5/303 50.4/236 50.7/304 51.8/197 51.8/419 51.3/1459 52.0/202

We see that there is no specific "momentum" nor "resilience" capability overall, there is practically no indication on how the OT would end based on which team scored the last GTG. The only two moderate exceptions with decent sample sizes are the second and the sixth columns of the penultimate row. The GTG-scoring team is 27-17 (61.4%) in case it scored the tying goal in the last two minutes, however if the GTG was scored before the last period, as it happened in 66 games, the momentum would obviously not carry over two or more intermissions, and the tying team is 27-39 (40.9%) in these games.

Here is how it looks on a graph:
We can see all lines wobbling slightly above the 50 mark. Insufficiently above. Even if we observe the extra 1.3% chance overall (2.0% in playoffs) - wouldn't it be more related to the home/away advantage? I haven't looked at this aspect yet. Maybe another time.

The Real Life[TM] took a bit of a toll on this blog... But we resume, with resilience and hoping to generate momentum!

Thursday, March 30, 2017

On the NHL Scoring System - Part III

Part I
Part II


Once again, driven by idea that if you want to encourage goal scoring, you need to reward the goal scoring in standings directly, not indirectly through winning. Then, based on the idea of a fellow hockey fan and blogger, a new suggestion was born in my mind.

Not so long ago I was involved in another discussion on the subject on Twitter, where an interesting alternative, 2-1-0-0 was described. The idea is that you still get two points for a win in regulation, just one point for a win in OT, but nothing if you lose, and, the key, both teams get nothing if the game is tied at the end of regulation (shootouts are abolished). This is a very sharp idea, but for me something felt very wrong, and then it crystallized:

It's not fair to reward a hard fought 5-5 tie with zero points, just like a lazy-skated 1-1. We still want to encourage goal scoring, and the simple 2-1-0-0 just unbalances the game. And so it dawned on me. We should reward goals with extra standings points!

The formula that first came to mind, and which seemed fair: give each goal a 0.1 point in the standings, while the win-scoring system shall be 2-1-0-0. If you or your database have an aversion against decimals, assign 20 points for a win, 10 points for OT loss, and 1 extra point for each goal scored. This will encourage goal scoring in any situation, and for both sides, including the games that go into garbage time pretty quickly. So, a 7-2 win will give the winner 2.7 points, and the loser 0.2 points. A 2-0 win will give the winner 2.2 points, the loser 0. A 4-3 OT win will give the winner 1.4 points, the loser 0.3 points. A 5-5 OT tie will give each side 0.5 points.

Wait, there's a caveat.

Imagine a situation where a team needs just 0.1 point to pass another one in the standings for the playoff spot. They are playing an opponent whose number of points in the standings does not have any effect on them. In such a situation, the team would play without a goaltender at all, because they don't care how much they lose, they just need that goal. Now, this is not really hockey, so to prevent this kind of play a restriction needs to be introduced:

Any goal scored without a goaltender on the ice, when not on a delayed penalty, and when trailing by more than two goals shall not yield any standings points.

Here is an example what the today's standings would look like under the suggested system:

Team                           W  OW T  L  GF  GA  P
Boston Bruins                  34 04 04 34 216 201 93.6
Montreal Canadiens             31 09 05 31 205 186 91.5
Ottawa Senators                32 04 08 31 191 191 87.1
--------------------------------------------------------
Washington Capitals            41 08 07 20 246 165 114.6
Columbus Blue Jackets          38 09 04 24 233 170 108.3
Pittsburgh Penguins            37 06 08 25 256 211 105.6
--------------------------------------------------------
New York Rangers               38 05 06 28 242 203 105.2
Toronto Maple Leafs            29 06 09 31 229 213 86.9
--------------------------------------------------------
New York Islanders             28 05 06 36 217 224 82.7
Tampa Bay Lightning            27 06 07 35 206 207 80.6
Carolina Hurricanes            28 04 07 36 198 208 79.8
Buffalo Sabres                 24 06 08 39 191 215 73.1
Philadelphia Flyers            22 07 11 36 193 218 70.3
Florida Panthers               21 07 11 37 192 210 68.2
New Jersey Devils              18 06 06 46 171 221 59.1
Detroit Red Wings              16 07 08 45 181 224 57.1
--------------------------------------------------------
Chicago Blackhawks             36 09 05 27 230 197 104.0
Minnesota Wild                 37 04 05 30 241 193 102.1
St. Louis Blues                35 06 02 33 213 200 97.3
--------------------------------------------------------
San Jose Sharks                35 06 03 32 204 185 96.4
Anaheim Ducks                  37 02 06 31 200 183 96.0
Edmonton Oilers                33 05 09 29 221 191 93.1
--------------------------------------------------------
Nashville Predators            33 04 06 33 224 206 92.4
Calgary Flames                 30 09 06 32 208 206 89.8
--------------------------------------------------------
Winnipeg Jets                  29 03 04 41 226 243 83.6
Dallas Stars                   27 04 02 43 207 240 78.7
Los Angeles Kings              23 11 06 36 183 185 75.3
Vancouver Canucks              19 07 06 44 169 221 61.9
Arizona Coyotes                17 04 08 48 176 245 55.6
Colorado Avalanche             14 06 01 55 150 257 49.0

Naturally, they would not be the same standings if the system were indeed implemented, but why not to take a look. And once again, try it in the AHL first, it won't hurt anyone.

Monday, March 13, 2017

On Buchholz and Sonneborn-Berger coefficients - Part II

Part I

2. The Sonneborn-Berger coefficient.
This stranger beast is a metric extensively used for tie-breaks in chess-round robins and as an auxiliary tie-break tool to the Buchholz coefficient in non-round robin. Let's start with the definition.

$$SB = Σ↙{n=1}↖N f(R_n,P_n)$$

where Rn is the result against the n-th opponent, and Pn is the opponent's points score.
The function  f(Rn, Pn) is defined as:

f(Win, Pn)  = Pn
f(Tie, Pn)  = Pn/2
f(Loss, Pn) = 0

The result value evaluates whether the participant performed better against stronger and weaker opposition. Actually, I do have a problem with this criteria as a tie-breaker, in my opinion ALL points are created equal, and it doesn't matter if they came from a contender or a bottom feeder. However, this metric does answer the notorious statements like "This team only shows up for big games" and "This team is only good against garbage opposition."

So, first of all, for the NHL application, we will modify the function f(Rn, Pn) to:

f(Win, Pn) = Pn
f(OW, Pn)  = 2*Pn/3
f(OL, Pn)  = Pn/3
f(L, Pn)   = 0

to account for the overtime point.

Then, we can calculate the minimal possible SBmin value for a team with the given schedule so far this season, by assigning Wins to be against the weakest teams played, and the OW/OL against the weakest remainder until the sum of W, OW and OL points add up to the number of points the team currently has.

Similarly we shall calculate the maximal possible SBmax value by assigning Wins to be against the strongest teams played, and the OW/OL against the strongest of the remainder, assuming OT wins are about 1/4 of the whole.

Then the closer the actual SB is to the SBmin or SBmax we may be able to say whether the team is successful more against the bottom feeders, the top guns, or whether it achieves its points from the whole spectrum available.

Here is the table describing how this season's teams have their SB positioned between SBmin and SBmax.

Team Points SBmin SBopt SB SBmax
Pittsburgh Penguins 1.40 44.28 46.48 46.24 53.06
Washington Capitals 1.40 44.70 46.74 47.77 52.89
Minnesota Wild 1.37 42.25 44.36 46.63 50.66
Columbus Blue Jackets 1.37 43.10 45.36 46.44 52.15
Chicago Blackhawks 1.34 41.61 43.90 43.79 50.80
San Jose Sharks 1.31 40.68 42.97 44.16 49.84
New York Rangers 1.30 41.25 43.67 45.55 50.92
Ottawa Senators 1.25 37.84 40.07 41.79 46.78
Montreal Canadiens 1.25 39.37 41.74 41.05 48.87
Anaheim Ducks 1.19 36.86 39.43 40.12 47.15
Calgary Flames 1.18 35.97 38.49 38.20 46.05
Edmonton Oilers 1.16 35.86 38.32 37.43 45.70
Boston Bruins 1.15 34.73 37.23 37.74 44.72
Nashville Predators 1.13 33.28 36.14 38.04 44.72
Toronto Maple Leafs 1.13 34.64 36.99 35.66 44.02
St. Louis Blues 1.12 34.69 37.14 38.52 44.50
New York Islanders 1.12 34.36 36.94 37.94 44.71
Tampa Bay Lightning 1.09 32.62 34.98 35.41 42.06
Los Angeles Kings 1.07 32.10 34.66 33.56 42.34
Philadelphia Flyers 1.04 31.26 33.56 32.01 40.48
Florida Panthers 1.03 30.89 33.12 30.95 39.82
Carolina Hurricanes 1.00 29.43 31.78 32.41 38.85
Buffalo Sabres 0.99 30.09 32.49 33.43 39.68
Winnipeg Jets 0.96 27.55 30.35 31.48 38.75
Vancouver Canucks 0.96 28.48 30.91 29.02 38.21
Dallas Stars 0.94 28.05 30.62 31.16 38.34
Detroit Red Wings 0.94 29.12 31.12 30.02 37.13
New Jersey Devils 0.91 27.78 30.15 28.63 37.27
Arizona Coyotes 0.84 25.13 27.24 25.86 33.56
Colorado Avalanche 0.61 17.90 19.74 19.98 25.25

Once again, we use Point Per Game values because the teams and their opponents have a different number of games played at most of the moments within a season.

We would dare to make one more step forward and claim that the team that performs closer to SBmax seem to have a coach problem (notable differences highlighted in green in the table above). The roster is there to compete against the best, but the points aren't trickling in at a pace good enough against the fodder. Similarly, if the SB value is closer to SBmin is more likely to have a GM problem (notable differences highlighted in blue in the table above), that its roster is not good enough to compete, but the coach is able to squeeze close to the maximum out of it. However, it is natural to win more games against the weaker teams, so we set the balance point at SBopt = (SBmax + 3*SBmin) / 4;

Wrapping up the talk about the Buchholz and the Sonneborn-Berger coefficients we would like to state that these values have an almost entirely descriptive value and without any predictive capability, with a small exception of the Buchholz-based remaining schedule strength metric. And even then, it's sort of a 'descriptive prediction'.

Please see more Buchholz and Berger-Sonneborn data on the website!

Sunday, March 12, 2017

On Buchholz and Sonneborn-Berger coefficients.


The practice of chess tournaments provides two traditional metrics that are used to rank participants beyond their mere scoring. Their names are the Buchholz coefficient and the Sonneborn-Berger coefficient (often called just Berger). They are frequently used as tie-breakers in chess events, however I arrived to completely different application for them for the National Hockey League seasons.

1. The Buchholz coefficient

The Buchholz coefficient is simply the sum of the points of your opponents.

B = Σn=1N Pn

So, if you played five games, and your opponents currently have 5, 3, 8, 6 and 6 points, your Buchholz value will be 28. Please note, that the current number of points is always used, not the number of points at the moment of meeting. The outcome of the game does not matter (for that one see the Sonneborn-Berger).

At first, the usefulness of such a criteria would prompt a raise of the eyebrow. However, it's not used in round-robin all-play-all tournaments as a final tie-break, because, naturally, the coefficient would be the same for all tied parties. It's used in a special format of chess events called the Swiss Tournament, not very popular outside of the realm of board games for purely logistic reason. But then, consider, first, an NFL season. The list of opponents every team plays there over the 16-game season may be quite different. And, whoever would end up with a larger Buchholz coefficient, clearly would've had stronger opposition on the way.

Now let's go back to hockey. First of all, at the end of the season, although everyone has played everyone, they did so a different number of times. Thus, the sum of opponents' points at the end of the season could be different between teams - including within the same division, if they had a different schedule. So, this could still be a very valid tiebreak. Secondly, the season is so long (82 games, unlike a chess Swiss which is rarely longer than 11 rounds), and that gives us a lot of midway points in time, when the all-play-all has not been completed yet! Here the Buchholz coefficient can clearly show, who has had the stronger opposition up until a certain moment.

Then, if we look at the remainder of the schedule for each team, and for every game we add the opponent's points we get an excellent remaining schedule strength estimator.

Wait... there's a caveat.

Unlike in a chess tournament, where every round occurs for everyone at the same time, and barring very rare circumstances, every participant played an equal amount of games at any point of the tournament, there may be a significant difference in the number of games played by different teams, so summing the opponents up will not work very well. And these opponents also played a different number of games, so their total amount of points is not a very good indicator.

Fortunately, it's not a big deal. Instead of totals, let's operate with per-game numbers. So the NHL Buchholz Coefficient for a team after N games becomes:

B = (Σn=1PPGn)/N. 

Same applies for the remaining schedule strength, where the per-game numbers of the remaining opposition are summed an averaged.

So, if the team played three games against opponents who currently are:
A) 6 points in 4 games, B) 3 points in 3 games, C) 2 point in 5 games, then the team's Buchholz value would be (6/4 + 3/3 + 2/5) / 3 = 2.9/3 ~ 0.967pts.

Here are the current (Mar 12th 2017) Buchholz coefficients and remaining schedule strengths for the entire 30 times (and note how the Blues stand out with plenty of matchups vs Colorado and Arizona remaining).

+-----------------------+-----------+-------+-------+
| Team Name             | PPG       | Buch  | RStr  |
+-----------------------+-----------+-------+-------+
| Washington Capitals   | 1.4179105 | 1.119 | 1.133 |
| Pittsburgh Penguins   | 1.4029851 | 1.117 | 1.127 |
| Minnesota Wild        | 1.3939394 | 1.090 | 1.070 |
| Columbus Blue Jackets | 1.3731343 | 1.125 | 1.132 |
| Chicago Blackhawks    | 1.3283582 | 1.088 | 1.096 |
| San Jose Sharks       | 1.2985075 | 1.106 | 1.106 |
| New York Rangers      | 1.2941176 | 1.120 | 1.184 |
| Ottawa Senators       | 1.2537313 | 1.105 | 1.169 |
| Montreal Canadiens    | 1.2352941 | 1.122 | 1.097 |
| Edmonton Oilers       | 1.1791044 | 1.121 | 1.040 |
| Anaheim Ducks         | 1.1764706 | 1.102 | 1.150 |
| Calgary Flames        | 1.1764706 | 1.099 | 1.140 |
| Boston Bruins         | 1.1470588 | 1.115 | 1.151 |
| Toronto Maple Leafs   | 1.1343284 | 1.114 | 1.150 |
| Nashville Predators   | 1.1323529 | 1.105 | 1.116 |
| St. Louis Blues       | 1.1194030 | 1.144 | 0.943 |
| New York Islanders    | 1.1194030 | 1.142 | 1.103 |
| Tampa Bay Lightning   | 1.0895522 | 1.121 | 1.134 |
| Los Angeles Kings     | 1.0746269 | 1.118 | 1.104 |
| Philadelphia Flyers   | 1.0447761 | 1.122 | 1.179 |
| Florida Panthers      | 1.0298507 | 1.118 | 1.175 |
| Carolina Hurricanes   | 1.0000000 | 1.138 | 1.136 |
| Buffalo Sabres        | 0.9855072 | 1.127 | 1.158 |
| Winnipeg Jets         | 0.9565217 | 1.110 | 1.143 |
| Vancouver Canucks     | 0.9558824 | 1.115 | 1.152 |
| Dallas Stars          | 0.9552239 | 1.119 | 1.100 |
| Detroit Red Wings     | 0.9545455 | 1.151 | 1.059 |
| New Jersey Devils     | 0.9117647 | 1.148 | 1.132 |
| Arizona Coyotes       | 0.8358209 | 1.133 | 1.098 |
| Colorado Avalanche    | 0.6119403 | 1.128 | 1.164 |
+-----------------------+-----------+-------+-------+

In tne next installment we're going to talk about the application of the Sonneborn-Berger coefficient to the NHL regular season.