**Learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL. Covered in this installment: data exploration with Apache Impala (incubating) and Hue.**

In Part 1 of this series, I introduced the topic of using fantasy sports analytics as an instructive use case for exploring the Apache Hadoop ecosystem. In that installment, we focused on data processing by taking a collection of data from Basketball-Reference.com and enriching it with z-scores and normalized z-scores to analyze the relative value of NBA players. This time, we’ll continue exploring the data interactively with Apache Impala (Incubating) in the Hue UI. This example will help illustrate that with CDH providing a storage layer and tools for data processing and data analytics like Spark and Impala, you can easily transform and explore data in a variety of ways.

All the code for this post can be accessed via Github, and refer to Part 1 for an overview of the data processing that got us to this point.

## Interactive Data Analysis: Finding Trends in Age and Experience

Last time, you learned how to work with a DataFrame and query it using Apache Spark SQL. This time, you’ll learn how to ask some more difficult questions and quickly get answers with Spark and Impala. Since we are thorough in our data processing, you’ll be able to get results from the data with only a few simple queries. This appoach is typical of the Apache Hadoop best practice of using data processing frameworks to reduce the complexity of interactive queries.

A burning question that is always on the mind of fantasy sports owners is: “How is age going to affect a player’s season?” Athletes are mere humans, and in time, their skills decline. When does an all-star regress to being an average player? Can we calculate the expected gain/loss of the value of a player as they age? We’ll focus on that topic here.

We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.

We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
//group data by player name val pStats=dfPlayers.sort(dfPlayers("name"),dfPlayers("exp") asc).map(x=>(x.getString(1),(x.getDouble(50),x.getDouble(40),x.getInt(2),x.getInt(3),Array(x.getDouble(31),x.getDouble(32),x.getDouble(33),x.getDouble(34),x.getDouble(35),x.getDouble(36),x.getDouble(37),x.getDouble(38),x.getDouble(39)),x.getInt(0)))).groupByKey() pStats.cache //for each player, go through all the years and calculate the change in valueZ and valueN, save into two lists //one for age, one for experience //exclude players who played in 1980 from experience, as we only have partial data for them val excludeNames=dfPlayers.filter(dfPlayers("year")===1980).select(dfPlayers("name")).map(x=>x.mkString).toArray.mkString(",") val pStats1=pStats.map{ case(name,stats) => var last = 0 var deltaZ = 0.0 var deltaN = 0.0 var valueZ = 0.0 var valueN = 0.0 var exp = 0 val aList = ListBuffer[(Int,Array[Double])]() val eList = ListBuffer[(Int,Array[Double])]() stats.foreach( z => { if (last>0){ deltaN = z._1 - valueN deltaZ = z._2 - valueZ }else{ deltaN = Double.NaN deltaZ = Double.NaN } valueN = z._1 valueZ = z._2 last = z._3 aList += ((last, Array(valueZ,valueN,deltaZ,deltaN))) if (!excludeNames.contains(z._1)){ exp = z._6 eList += ((exp, Array(valueZ,valueN,deltaZ,deltaN))) } }) (aList,eList) } |

We’ll now process the list of age-value pairs. A new function is defined, `processStatsAgeorExperience`

, which goes through our normal process of mapping raw statistical data to `bballStats`

objects and reducing it by each statistic. This gives our aggregate stats for each statistic. We’ll again need to take our RDD and convert it into a DataFrame. In this example, we’ll load it directly into Apache Hive for querying by Impala in Hue. Hue lets you take advantage of some visualizations that will make analyzing the data a little easier.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
//extract out the age list val pStats2 = pStats1.flatMap{case(x,y)=>x} //create age data frame val dfAge = processStatsAgeOrExperience(pStats2, "age") //save as table dfAge.saveAsTable("Age") //extract out the experience list val pStats3 = pStats1.flatMap{case(x,y)=>y} //create experience DataFrame val dfExperience = processStatsAgeOrExperience(pStats3,"Experience") //save as table dfExperience.saveAsTable("Experience") |

## Visualizing the Results via Hue

Now let’s hop over to Hue to see our results. We’ll query the Age table and utilize the charting feature of Hue to visualize the results, plotting age on the x-axis and mean z-score on the y-axis:

`Select * from Age Order By age asc`

Average zTot for each age group

The findings are pretty clear: on average, young players struggle to contribute value until their mid-twenties, peak at 28, create pretty good value in their early 30s, and then tail off quickly starting at 37. If you recall that we earlier looked at the spread of recorded seasons across the different ages, most of the seasons were concentrated between the ages 22-32, which means the tails on either end are working with very small amounts of data, hence explaining the large swings. (The youngest and oldest players also get fewer minutes on average, which affects their values in counting stats [PTS, 3P, TRB, AST, STL, BLK] and results in lower z-scores and n-scores. Our analysis ignores minutes played in calculating values, but a similar exercise can be done on stats averaged out per 36 minutes of play. Players who don’t play much generally don’t significantly contribute to a fantasy team, so ignoring playing time suits our purposes here. )

Looking at nTot values for each age, we see a similar pattern of people peaking in their late 20s and declining through the 30s. Note that the mean nTot is much lower than the mean zTot, which one can interpret as there being more below-average players and only a few “stars” who excel in a majority of categories—making great players a rarity. (This is likely one of the reasons it’s challenging to find deep leagues that dig into more than the top 120-150 players in the league. In 1980, we have 55 players posting net positive normalized z-scores and 115 in 2015.)

Average nTot for each age group

By inspecting the change in z-score and normalized z-scores, we notice that a player on average continues to improve his game until age 26, then begins to decline:

Change in zTot for each age group compared to the previous year

Change in nTot for each age group compared to the previous year

If you’re wondering why the average player peaks at 28 but only continues getting better until age 26, the reason is due to the fact that we’re not looking at the same set of players from each age to the next. Recall that the peak year for player participate is age 24, in which 1,626 seasons have been logged. That means that in the beginning we’re calculating average z-scores that include rookies, but those are not included in those years’ delta scores, as they have not yet logged a full season. Similarly, as players begin to drop out of the league, they no longer factor into the delta calculations. At age 25, we have 1,455 seasons logged, down 171 from age 24. These are largely players that were cut from the league (i.e. among the worst players), so removing them results in a net positive across all players of that age group, even if the average player is beginning to decline slightly.

Next we’ll look at the same metrics, now organized by experience:

`Select * from Experience Order By experience asc`

Average zTot of players by experience level

Recall that by looking at experience, we are removing the ambiguity about what age players enter the league. This approach offers a better view of the longevity of a player and the wear and tear they sustain over time. Additionally, we record 2,738 total seasons of 0 experience, and each year declines after that. (Indeed, 626 players never make it to their second year in the NBA!) Most players prove to peak around year 7, but return positive value while steadily declining until around year 13. Note that by year 14 we are down to 196 seasons logged, so we’re really dealing with an elite group that can sustain productivity that long. Looking at nTot tells a similar story, so we’ll omit it. Looking at the delta z-scores shows us that players stop improving, on average, after their 4th year.

Change in zTot among player experience compared to the previous year

Note that in year 4 we are down to 1,225 players, fewer than half of which we started.

Oddly, looking at age alone seems to indicate that players have a longer period of growth than when just looking purely at experience. This anomaly is due to the fact that players enter the league at different ages, and there is usually a sharp increase in the first couple of years followed by a gradual decline. If we look at the number of rookies per age, we see that there have only been 100 who were 18 or 19, 400 who were 20 or 21, and over 1,400 who were 22 or 23. After that, it drops off quickly. Considering that most players enter the league at 22 or 23 and on average will continue to improve for 4 years, the average player would appear to improve into age 26 or 27, which is what we saw when we looked at the delta stats over age. This fact highlights the importance of knowing the *context around the data* to avoid making erroneous conclusions. With that in mind, we’ll focus on *experience* over *age* going forward.

Also, don’t forget that we’re speaking of the average player here—some break the rules. Let’s look at a few examples, picking nTot as our metric. (zTot leads to similar conclusions.)

nTot scores for Michael Jordan per experience level

nTot scores for Shaquille O’Neal per experience level

nTot scores for Allen Iverson per experience level

Here we see trends that agree with our analysis above. By Michael Jordan’s 4th year, he had reached fantasy-elite status and his subsequent years were around the same mark. Similarly, Allen Iverson and Shaquille O’Neal were both playing at or near their peak level within 4 years. Being elite players, they maintained high value late into their careers—which is not what we would expect of the “average” player—but even in the examples above, there were a couple years of growth before stardom was justified.

A few players arguably break from the average trajectory:

- Kyle Korver is an interesting example of a player who had trouble in his early years but emerged as a specialist and has thrived in his 30s.
nTot scores for Kyle Korver per experience level

- Tyreke Evans is a rare player who peaked in his rookie year and has not improved since due to injuries.
nTot scores for Tyreke Evans per experience level

- Kawhi Leonard has not regressed in his short career. He spent many of his first years in reserve, so his value has increased as he has seen more playing time and taken up more of an offensive responsibility. It’s likely that he’s reaching his peak from a fantasy standpoint, as he’s already ranked in the top 10 this year, so his development is likely reaching its conclusion this season.
nTot scores for Kawhi Leonard per experience level

- Stephen Curry, however, was troubled with injuries early in his career and has arguably seen his growth delayed. He’s 27 now and possibly just had the best season he will ever have.
nTot scores for Stephen Curry per experience level

All of this goes to show that it pays to know the context of the players in providing insight past the numbers.

Which brings us to our next point: Who’s the best of all time? The answer differs depending on whether we look at z-scores or normalized z-scores. Looking at raw z-scores, we see that Curry just had the best year ever. Larry Bird’s 1987 campaign comes in a close second. We see other great players in the mix as well, like Michael Jordan and Kevin Durant.

`select name, year, age, zTot from players order by zTot desc`

When we switch to normalized z-scores, it becomes the Michael Jordan show: he records 5 of the top 10 seasons.

Curry’s 2015-16 season comes in at number 6. That’s certainly impressive, but he still lags Michael Jordan’s peak years. Curry could improve in the next few years, but given his tenure in the league, it’s more likely he will start on a slight downward trend.

## Conclusion

In this installment, we used Spark and Impala to determine at what year in a player’s career he is expected to peak. We also calculated how much a player is expected to change in value from one year to the next. These are valuable data points that can be used to help determine which players are expected to increase or decrease in value between seasons.

*Jordan Volz is a Systems Engineer at Cloudera.*