Jump to content
NHL'94 Forums

Whitepaper: How Tickenest Does All Of That Technical Stuff (Stats, Graphs, Discord Bot)


Recommended Posts

This document describes in reasonably plain English the methods that I use to do all the things that I do with graphs and stats and Discord bots. My purpose in writing it is to share knowledge with those who may have an interest in how to do such things, whether in pursuit of NHL ’94 analytic excellence or just for general skillbuilding.

This document is divided into several sections. Each section covers a distinct chunk of the processes that I have developed, although some processes feed into others or at least dovetail with others. I promise not to get too technical, meaning that I’m not going to deep dive in lots of code or anything. I’m going to try and stick to clear explanations of things. Technical information will show up from time to time, but I’ll always try to give intuitive explanations to go with the technical explanations.

  • Thanks 1
  • Like 1
Link to comment
Share on other sites

Section 1 – Scraping Boxscores From nhl94online.com

Part A – nhl94online.com Background

In order to process data (NHL ’94 match data, in this case), first, we must have the data. The data begin life as entries in a database on nhl94online.com. Every time a player uploads a save state, the website’s database generates an entry in a table (possibly across multiple tables). That entry contains all of the information that the game produces about the game that was played (or, at least, all of the information that we know about).

Whenever a user clicks on a box score link on the website, the database provides the data for that box score to a PHP script. This script then generates, live, an HTML page that contains all of the box score information, formatted nicely in the format that we’re all used to. I’ll note that this is as opposed to all of the box scores being stored as HTML files, which would be slightly faster, as the webserver could just pass the existing HTML file to the user’s browser, but much less efficient, as the server would have to create and store a new HTML file for every box scored uploaded. The “generate the HTML live from the box score data by means of a script” method trades a little bit of time lost for a lot of storage space on the webserver gained.

Therefore, one way to acquire the box score data on the website would simply be to acquire direct access to the database. I asked chaos, our webmaster, about this possibility, and he did eventually provide me with some dumps of database tables as a test to see if I could process the data, but by the time he was able to do so, I had already worked out an alternate method of acquiring the data that did not require any action on his or anyone else’s part.

What I did was to create a Python script that can automatically (with some human guidance) download box score HTML pages from the website, extract the box score statistics from the HTML, and organize the box score data. Python is a general-purpose programming language that is quite popular for data processing and analysis, and I was already quite familiar with its capabilities when I began this project.

Part B – Processing One Box Score

In order to process lots of box scores, we have to download them and then we have to process them. Let’s start by talking about how to download and process one box score (we’ll discuss downloading and processing lots of box scores next). If you’ve never looked at HTML code before, it’s essentially specially structured text that allows a web browser to create all of those pretty web pages we’re used to seeing. In the case of nhl94online.com, there’s a lot of HTML code that surrounds that actual box score names and numbers that we’re interested in. I estimate that the ratio of HTML code to actual box score information is about 100 to 1, meaning that for every text character (letter or number) of box score information in one box score HTML page, there are about 100 characters of HTML code. For example, here’s a small segment of code containing the information about one skater’s stats from one game:

<tr class="evenrow" ><td colspan="2">42-Sergei Makarov</td><td>F</td><td>3</td><td>3</td><td>6</td><td>3</td><td>7</td><td>1</td><td>14:01</td></tr>

 

I count 148 characters in that snippet, 29 of which are actual data about the game, a ratio of about 5 to 1, which is much better than 100 to 1, but I deliberately chose a particularly data-dense snippet of HTML code.

So how do we extract the relevant box score information from that? Fortunately, in Python, there are multiple methods of doing so. I chose a fairly simple, somewhat inefficient, and kinda crude method of doing so, but it works very well so I’m sticking to it. My method was to examine the HTML very carefully in order to determine where exactly all of the different pieces of information start and stop within the HTM. Remember before when I said that a PHP script generates the HTML code live whenever someone clicks on a link? Well, fortunately, that means that all box scores on the site have the same underlying structure in the HTML.

So what? Well, that means that if I can write code that will process and extract the box score data from one box score on the site, then I’ll have a script that can process and extract the box score data from every box score on the site! And so I did that. I painstakingly examined the HTML, figuring out how to go through it in Python code, extract every single piece of relevant box score data, and organize it into a proper structure. Once this parsing (as the process is known) of the HTML is complete, my Python code then packages the box score information into a tidy package of data, ready for shipping to wherever I need it.

Okay, I lied a little bit in the above paragraph. I came across a few complications during the HTML parsing process that threw me for a loop until I discovered the problem. First, there are really three different box score formats on the website. I call them Classic SENS, Classic Genesis, and Blitz. On the Genesis side, years ago, various people on the forums figured out how to modify the NHL ’94 Genesis ROM in order to extract a few extra pieces of information: Checks Against and +/-. The first league to use this modified ROM was called a “Blitz” league, and so I’ve chosen that name to describe leagues that use the modified ROM as a base: Blitz leagues. The save state processing code on the website can detect whether it’s processing a Blitz save state or a Classic Genesis save state and do slightly different things in order to produce slightly different outputs (specifically, writing Checks Against and +/- for every player who played in the game to the box score.) Well, the two different Genesis formats do cause the website to produce slightly different HTML code in the box scores, and so I had to adjust my own Python code to detect whether it’s processing a Classic Genesis box score or a Blitz box score accordingly. In the end, my Python script doesn’t bother outputting the extra stats (Checks Against and +/-) because I don’t think they’re very important.

In a similar vein, all SNES leagues output one kind of box score, but SNES does not record Checks For or Time on Ice in the ROM, and so those stats do not appear for players in the box score. My Python code is able to detect this situation, as well. However, I do save player Checks For and TOI in Gens, and so my code does output slightly different information for SNES box scores than for Gens box scores. The lack of TOI information makes no difference as I don’t do anything with TOI in further calculations (except on special occasions when I’ve attempted to assess, say, player penalty rates), I do have to skip a couple of graphs in my graph outputs because of the lack of Checks For.’

I note that on rare occasions, a section of data is missing from a box score. Specifically, one team’s skater data may be missing if, somehow, no skater registered any statistics at all. This is only possible in SNES, because all skaters get their TOI recorded and so there will always be skaters in Genesis box scores. It is also possible for one team’s goalie data to be missing if one team didn’t manage a single shot and that goalie didn’t manage any other statistics in the game. When this does happen, it is most often because the box score is from a simmed game. A simmed game occurs when the two players cannot save a save state from the end of their game for any reason, typically because of a late disconnect or someone pressed the R key to reset the emulator instead of the F5 key to take a save state, or shucks, sometimes, both players just forget. In such cases, one player will start a new game, play against the computer up to the correct final score, and then take and submit a save state from that game. Typically, but not always, these games are front loaded with goals, because the human just wants to be done with the simming as quickly as possible. I do not exclude suspected sim games from my database, as they are typically preferable to not having the box score at all. In any case, my Python code can handle games with missing sections of data.

Lastly, there are a few instances on the website where a box score does not load properly and no statistics are available. In such cases, I simply do not have an entry for that game in my database, as there’s no data to be stored. The website typically does have the final score, but nothing else. I do not know why this phenomenon occurs.

Part C – Storing One Box Score

So once we’ve successfully processed one box score, what do we do with this information? Where do we put it? Well, my preference is to store the data in what’s called a CSV, or Comma Separated Values, file. This is essentially a text file that stores tabular (spreadsheet) data. One row of the CSV file is all of the data for one box score. Each individual piece of data (each individual column) is separated from the surrounding data in the text file by commas. CSV is a common format for exchanging data between various software programs because of its simple format. Practically any software package, including Excel, can read CSV data. And so when I process box score data, whether one match or many matches, I write the processed data to a CSV file, where I can do further analysis on it.

Part D – From One Box Score To Many Box Scores

We’re doing great! We can process the HTML that makes up a box score and extract all of the relevant stats! But….there are tens of thousands of box scores on nhl94online.com. A given league typically has hundreds of box scores available. Do we have to click on every link and save the HTML data?

Thankfully, no. We can script this process as well with Python. The key insight is that the format of box score URLs is consistent. Let’s look at two examples:

https://www.nhl94online.com/html/box_score.php?gameid=83512

https://www.nhl94online.com/html/pl_box_score.php?gameid=20041

The first URL is for a regular season box score. The second is for a playoff box score. The only different is that the playoff box score has an extra “pl_” in the URL. Also, each game is identified uniquely by the number after the “gameid=” part at the end. Every box score on the website is assigned a unique ID number. Or, as I recently learned, each regular season and each playoff box score is assigned a unique ID number for regular season games or for playoff games. In other words, only one regular season game on the website will ever be assigned a given ID number, and only one playoff game on the website will ever be assigned a given ID number, but it is possible for a regular season game and a playoff game to be assigned the same ID number. But this is not a problem because regular season URLs and playoff URLs use those two differing formats.

This means that if we can find or generate a list of IDs for the box scores that we want to download, then we can write a Python script that downloads the HTML files. The bad news is that nhl94online.com does not offer, say, a single webpage that lists the entire schedule of games that will be played in a single league. The good news is that we can work around this. I’ll note at this time that when I say “a single league” below, what I really mean is “one level of one league” (A, B, C, etc.) A league that has one level is one “league” for box score downloading purposes, while a league that has 8 or 9 divisions (as is the case in combined Classic seasons, for example) is 8 or 9 “leagues” for box score downloading purposes.

Here is the method of scraping (downloading) all of the HTML box scores for a single league:

1.       Download the standings page of the league. The standings page has links to every coach’s individual page, and we can easily grab those URLs from the standings page because coach’s page URLs are standardized, just as the box score URLs are standardized.

2.       For each coach in the league, download the coach’s page. Each coach’s page has a collection of URLs like the regular season URL shown above, and we can easily pick out the ID numbers of box scores that have been uploaded to the website so far (i.e. we can easily pick out which games on the schedule the coach has already played.) Once we do this for every coach, we have a full list of ID numbers for games that have been played in the league.

3.       Run a Python script to download, one at a time, each individual box score for that league in HTML format and process the HTML. If a league has 264 games and they have all been played, we can download 264 HTML box scores and process them in order to produce a CSV file of 264 rows, one per game. In practice, we throttle (slow down) the downloading script because we don’t want to overwhelm the nhl94online.com server with too many requests (and this might also cause us to get blocked by the website.) In practice, I typically institute a 3-second delay between requests for box scores, and so downloading, say, 264 box scores takes about 15 minutes, on average.

Once this process is complete, then we have full match data for a season! We can then do advanced statistics on these data and create beautiful graphs. Alternately, we can then answer Discord bot queries about the data.

One additional detail is that we can have the code detect which box scores, if any, have previously been downloaded by the script. After all, no reason to slow down the processing and strain the website unnecessarily if we’ve already downloaded some of the box scores previously

 

Part E – Summary

That was a lot. Let’s sum up what I discussed in this section.

·         Match data on nhl94online.com is stored in a database, and when you click on a box score link, a script generates the nice box score HTML file that you view in your browser.

·         We use a Python script to process the HTML in order to extract the box score data. There are three types of box scores on the website: Classic SNES, Classic GENS, and Blitz. They are all slightly different from each other, but my Python code can handle all three without difficulty.

·         On very rare occasions, an entire section of skater/goalie data may be missing because no such statistics were recorded during the game, but our script is able to handle those rare situations, as well. Also, a few games in the database do not produce any data in their box scores. These games are skipped in our database.

·         We store processed box score information in a CSV file, which is essentially a spreadsheet in text form. CSV files are easily read by any statistical/analytical/spreadsheet software.

·         We can download every box score for a given league (one level of one league) by downloading the standings page, getting the URLs of the coach pages and downloading those, extracting the ID numbers of the box scores for games that have been played to date, and then downloading and processing those box scores one at a time, producing one CSV file with one row for every box score that we processed.

And here is a full list of every field (column) of data that appears in my processed box scores at this point:

Match ID, Reg. Season or Playoffs?, League Name, Date/Time, Home Team Name, Home Team Abbreviation, Home Player, Away Team Name, Away Team Abbreviation, Away Player, Away Score, Away Shots, Away Shooting %, Away PP Goals, Away PP Tries, Away SH Goals, Away Breakaway Goals, Away Breakaway Tries, Away One-Timer Goals, Away One-Timer Tries, Away Penalty Shot Goals, Away Penalty Shot Tries, Away Faceoff Wins, Away Checks For, Away PIM, Away Attack Zone Time, Away Pass Comps, Away Pass Tries, Away Goals and Shots (all 3 periods + overtime), Away Skater Stats, Away Goalie Stats, Away Goals (the list of away goals with details), Away Penalties (list of penalties), All of THOSE statistics again (but for the Home Team), Total Number of Faceoffs, Highest db Level

Link to comment
Share on other sites

Section 2 – From Box Scores To Advanced Stats

Part A – Expanded Box Scores With Python

Now that we have a nice, structured, tabular dataset of our box scores, we’re ready to do some fancy statistics. Unsurprisingly, Python is well suited for this kind of work. Specifically, the “pandas” module (extension) for Python is commonly used for processing tabular data of the type that we have here. The name “pandas” is an abbreviation of “panel data”, which is a type of time-series data. The guy who created the pandas module used it for panel data, and that is how it got its name, but pandas is useful for all kinds of analysis. With pandas, we can load the data into a dataframe (essentially a table). From there, we can perform fancy mathematical operations in code.

There are multiple methods of doing the kinds of analysis that we want to with the data, but I will describe my chosen method here. Most of the time, I want to create a “season summary” of a bunch of statistics for every player who participated in one season. However, it is easy to extend “season summary” to related types of answers such as “one player’s lifetime record” or “two players’ head-to-head stats” or even “all-time statistics for every player who’s ever played at least one game all at once.” Whichever of those options that I choose, the basic method of deriving most of the summary statistics that I’ve developed is:

1.       For every individual game in the dataset, calculate the relevant statistic (essentially, extend each game’s box score.)

2.       For every individual player (human) in the dataset, calculate the aggregated statistic for that human.

For example, the very first thing that my code calculates about each game in the dataset is “which team won?” Obviously, this is a rather trivial calculation, as the winning team is the team that scored more goals. The reason for doing so is that it is a bit more convenient to have a dedicated “which team won?” column for use with later calculations where I need to know which team won. But…what exactly do I want to record here? Do I want to record the name of the winning team? The name of the winning human? Well, those could certainly be done, but it turns out that the best way (I think) to record “which team won?” is to record the winner based upon Home/Away, meaning that we’ll record if the Home team won, the Away team won, or it was a tie. We’ll create a new column at the far right of the table, and ask pandas, for each row (remember, one row = one box score), to place a “1” value in the new “WhichTeamWon” column (I don’t put spaces in my column names because it is easiest to work with column names when they don’t have spaces) if the Home team won, -1 if the Away team won, and 0 if the game was a tie (I’ll note that “positive for Home, negative for Away” will be a running theme throughout the box score processing.) The nice thing about this calculation is that I don’t have to write a separate line of code for each individual box score in the dataset and I don’t have to use some sort of loop to process each row one at a time. Instead, I can just do it like this (“myDF” is how I refer to the dataframe):

myDF['WhichTeamWon'] = 1

myDF.loc[(myDF['HomeScore'] == myDF['AwayScore']), 'WhichTeamWon'] = 0

myDF.loc[(myDF['HomeScore'] < myDF['AwayScore']), 'WhichTeamWon'] = -1

I plan to keep the code to a minimum in this document, but I’ll describe these three lines here. The first line says “create a new column called ‘WhichTeamWon’ and, for every row in the dataframe, assign the value 1.” The second line says “for every row where the HomeScore and the AwayScore are the same, change the 1 to a 0.” The third line says “for every row where the AwayScore is greater than the HomeScore, change the 1 to a -1.” If I really wanted, I could condense this code even further, but I choose not to in order to keep it more readable.

In a similar manner, we can keep adding new columns to the box scores dataframe with (mostly) simple calculations. As another example, the game tells us how many goals each team scored in each period, but it doesn’t tell us how many penalties each team committed in each period. But we can calculate that because we have the penalty summary from the box score, and so we can parse the penalty summary in order to calculate how many penalties the Home and Away teams had in each period. Once we’ve done so, we can then use those extensions of the box score dataframe in order to calculate summary statistics for each player for the whole season.

Part B – From Single-Game Box Scores to Full-Season Player Stats

Expanded box scores are occasionally useful on their own, but much more often, they are a tool for calculating advanced statistics about the humans’ seasons. So how do we do this? Well, pandas provides an easy method of performing what is known as a “groupby” operation, meaning that we can take a column, such as “number of one-timer goals the Home team scored in the game”, and, with just a bit of code, perform the calculation “for each human player, calculate how many one-timer goals they scored at Home.”

Now, you might be wondering why we would want to calculate every human’s one-timer goals at Home, as opposed to their one-timer goals Home and Away combined. There are two reasons. First, Home/Away stat splits can be interesting in and of themselves. Second, because of how the data are structured in the box scores (Home team data and Away team data), the easiest method of generating “total one-timer goals for every player for the whole season” is “calculate every player’s Home one-timer goals, calculate every player’s Away one-timer goals, and then add the two together.” In fact, most of the summary stats are calculated in this way (pretty much any stat that isn’t a multi-game stat, such as “longest shutout streak across multiple games.”)

Therefore, I wrote a function called “HATCalc”, for “Home/Away/Total Calculation.” With this function, we can quickly calculate most of the summary statistics just by telling the function (in code, of course) how to aggregate the relevant statistic and on which games to calculate the statistic. For example, to calculate “wins in one-goal games,” we tell the function to count up, for each player, how many times they were the Home player in a game where HomeScore – AwayScore = 1, how many times they were the Away player in a game where AwayScore – HomeScore = 1, and add the two together. To get the related statistic “losses in one-goal games,” for each player, we count up how many times they were the Away player in a game where HomeScore – AwayScore = 1, how many times they were the Home player in a game where AwayScore – HomeScore = 1, and add the two together. In fact, the vast majority of the statistics can be calculated this way, where we calculate the “positive” statistic (goals scored, shots on goal, one-timer attempts, etc.) and then quickly calculate the corresponding “negative” statistic (goals allowed, shots on goal allowed, one-timer attempts allowed, etc.) by using the same code, except carefully substituting “Home” and “Away” in the appropriate locations.

As I mentioned, a few of the season summary statistics are more complicated to calculate, but no need to go into those details here. Anyway, in the end, we wind up with an expanded dataframe of greatly enhanced box scores, and a separate dataframe of the season summary stats for all human players-exactly what we wanted.

Part C – Summary Stats and Graphs, Graphs, Graphs!

Now we’re ready to answer actual questions with the data. Technically, we could before, but at this point, we have detailed information about the humans’ overall performance, which is usually much more interesting to people than information about individual games. First, we can create a “season summary,” which is just an organized collection of leaguewide statistics, such as “what was the record of Home teams?” or “how many goals per game were scored?” or “what were the shooting percentages in the Up or Down direction?” Given that we’ve already done most of the math previously, assembling these stats only requires a little bit more code, including the code to print the results in a clean format.

Next, we can start making graphs. I begin a set of graphs with a set of heatmaps, which are those graphs with lots of numbered squares that show things like “how many times did a given scoreline happen in the season?” and “what was the distribution of Winner/Loser shots across every game?” I use the seaborn Python module for this.

Once those are complete, I begin making hundreds of graphs. For this, I use the matplotlib library. Creating precise graphs in code can be tricky, but the good news is that I only needed to perfect a few different formats, such as “one bar for each player,” “two bars for each player,” “three bars for each player,” and so forth. The code to create such graphs is fairly precise, but once they’re created, most of the time, I can easily substitute just a small portion of code that tells matplotlib which statistic to use for the next graph. In other words, once I’ve created my graph templates, I can use them quickly and easily to create lots of graphs.

I typically create separate “Home/Away” and “Total” graphs, or, for some statistics, separate “Home,” “Away,” and “Total” graphs, depending upon the number of bars required for the graph. In fact, I originally wrote the HATCalc function in order to consolidate my code, but quickly realized in the process that I had previously been discarding the separate Home/Away calculations, only being interested in the Total calculations. Once I wrote HATCalc, I began retaining those calculations and creating separate Home/Away graphs, thereby completely failing at my “consolidate the code” effort (the code base got larger after I affected this change), but gaining new insights into Home/Away splits.

Once the bar graphs are complete, the next step is to generate a few dozen scatterplots. A scatterplot is a visualization of two variables at once, one on the x-axis and one on the y-axis. Each dot represents the relevant pair of statistics for one human. For example, one scatterplot shows “one-timer success percentage vs. one-timer attempts per game”. Dots farther to the right belong to players with more one-timer attempts per game, and dots higher up represent players with a higher one-timer success rate. In this way, we can gain insights into player performance that we can’t with one-dimensional bar graphs.

The matplotlib module can produce the scatterplots in a manner similar to that of the bar graphs. One additional complication (but potential opportunity) is that we have to specify the minimum and maximum extents of the scatterplots on both axes. The complication is that we don’t know ahead of time what the proper extents should be because we don’t know what exact values will be used for each scatterplot. Fortunately, because the scatterplots typically use percentages or per-game values, the values will not be excessively high (say, in the hundreds), except for graphs involving attack zone time per game or per goal. In any case, we can calculate the minimum and maximum values of the variables being plotted and add a small amount to those values when specifying the minimum and maximum values that should be shown on the x-axis and y-axis.

A second additional complication is that we want to label each dot with the name of the player to whom the dot belongs. This particular detail is the most difficult challenge involved with creating these scatterplots, as we literally have to specify the x and y positions of each label. In the end, I wound up creating a pair of formulas to calculate the x and y position of each label based upon the previously calculated minimum and maximum extents of the scatterplot, though the formula is not perfect, as the spacing between the dots and the labels is not consistent in all scatterplots. If I were being paid to do this, I would probably improve the method.

Long after I developed the initial code, I had the idea of producing player summaries based upon the seasonal data. This proved to be fairly easy, given that most of the calculations were already complete. I merely had to assemble the correct calculations. In particular, the “opponent” calculations (goals scored by opponents, shooting percentage of opponents, etc.) don’t have to be re-calculated, because “opponents’ goals scored” is the same as “player’s goals allowed.” After the calculations and formatting are complete, we get dozens of player-specific statistics. A similar process creates several player-specific scatterplots.

Part D – Uploading to the nhl94.com Forums

Unfortunately, there is no easy method of uploading all of the graphs to the nhl94.com forums (my preferred posting location). I can easily choose to upload all of the graphs at once, but instead, I typically omit the Home/Away graphs for the simple fact that I already upload well over one hundred graphs, and I don’t want to strain the forum server even more. The additional reason is that “upload a graphic to a forum post” is a different process from “add a graphic to a forum post,” as I have to select each graph specifically once the graphs have been uploaded in order to add each graph to the forum post. It is a bit of a pain! But the results are worth it.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Who's Online   0 Members, 0 Anonymous, 35 Guests (See full list)

    • There are no registered users currently online
×
×
  • Create New...