This page shows how to combine NBA play by play data with SportVu data. The play by play dramatically increases the usefulness of the SportVu data by allowing the identification of plays that are misses and makes as well as the type of shot, e.g., layup or dunk. I have also posted my earlier markdown on exploring the SportVu data.
To read the sportvu data, first download the _functions.R file in my github repository for this project.
library(RCurl)
## Loading required package: bitops
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
source("_functions.R")
The sportvu_convert_json function takes the sportvu json file and converts it into a data frame. For this game, the function takes about 3 minutes to convert the file. The resulting data frame is about 2.6 million observations by 13 variables.
all.movements <- sportvu_convert_json("data/0021500431.json")
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
str(all.movements)
## 'data.frame': 2646562 obs. of 13 variables:
## $ player_id : chr "2225" "2225" "-1" "-1" ...
## $ lastname : chr "Parker" "Parker" "ball" "ball" ...
## $ firstname : chr "Tony" "Tony" NA NA ...
## $ jersey : chr "9" "9" NA NA ...
## $ position : chr "G" "G" NA NA ...
## $ team_id : num 1.61e+09 1.61e+09 NA NA 1.61e+09 ...
## $ x_loc : num 51.7 51.7 52.9 52.9 60.4 ...
## $ y_loc : num 40.3 40.3 39.9 39.9 31.8 ...
## $ radius : num 0 0 2.5 2.5 0 ...
## $ game_clock: num 716 716 716 716 716 ...
## $ shot_clock: num 13.3 13.3 13.3 13.3 13.3 ...
## $ quarter : num 1 1 1 1 1 1 1 1 1 1 ...
## $ event.id : num 2 1 1 2 2 1 2 1 2 1 ...
gameid = "0021500431"
pbp <- get_pbp(gameid) #From the .functions file
head(pbp)
## GAME_ID EVENTNUM EVENTMSGTYPE EVENTMSGACTIONTYPE PERIOD WCTIMESTRING
## 1 0021500431 0 12 0 1 8:11 PM
## 2 0021500431 1 10 0 1 8:11 PM
## 3 0021500431 2 5 45 1 8:11 PM
## 4 0021500431 3 2 5 1 8:12 PM
## 5 0021500431 4 4 0 1 8:12 PM
## 6 0021500431 5 5 45 1 8:12 PM
## PCTIMESTRING HOMEDESCRIPTION NEUTRALDESCRIPTION
## 1 12:00 <NA> <NA>
## 2 12:00 Jump Ball Towns vs. Duncan: Tip to Green <NA>
## 3 11:43 <NA> <NA>
## 4 11:29 MISS Wiggins 2' Layup <NA>
## 5 11:28 <NA> <NA>
## 6 11:27 <NA> <NA>
## VISITORDESCRIPTION SCORE
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 Parker Out of Bounds - Bad Pass Turnover Turnover (P1.T1) <NA>
## 4 Leonard BLOCK (1 BLK) <NA>
## 5 Leonard REBOUND (Off:0 Def:1) <NA>
## 6 Leonard Out of Bounds - Bad Pass Turnover Turnover (P1.T2) <NA>
## SCOREMARGIN PERSON1TYPE PLAYER1_ID PLAYER1_NAME PLAYER1_TEAM_ID
## 1 <NA> 0 0 <NA> <NA>
## 2 <NA> 4 1626157 Karl-Anthony Towns 1610612750
## 3 <NA> 5 2225 Tony Parker 1610612759
## 4 <NA> 4 203952 Andrew Wiggins 1610612750
## 5 <NA> 5 202695 Kawhi Leonard 1610612759
## 6 <NA> 5 202695 Kawhi Leonard 1610612759
## PLAYER1_TEAM_CITY PLAYER1_TEAM_NICKNAME PLAYER1_TEAM_ABBREVIATION
## 1 <NA> <NA> <NA>
## 2 Minnesota Timberwolves MIN
## 3 San Antonio Spurs SAS
## 4 Minnesota Timberwolves MIN
## 5 San Antonio Spurs SAS
## 6 San Antonio Spurs SAS
## PERSON2TYPE PLAYER2_ID PLAYER2_NAME PLAYER2_TEAM_ID PLAYER2_TEAM_CITY
## 1 0 0 <NA> <NA> <NA>
## 2 5 1495 Tim Duncan 1610612759 San Antonio
## 3 0 0 <NA> <NA> <NA>
## 4 0 0 <NA> <NA> <NA>
## 5 0 0 <NA> <NA> <NA>
## 6 0 0 <NA> <NA> <NA>
## PLAYER2_TEAM_NICKNAME PLAYER2_TEAM_ABBREVIATION PERSON3TYPE PLAYER3_ID
## 1 <NA> <NA> 0 0
## 2 Spurs SAS 5 201980
## 3 <NA> <NA> 0 0
## 4 <NA> <NA> 5 202695
## 5 <NA> <NA> 0 0
## 6 <NA> <NA> 0 0
## PLAYER3_NAME PLAYER3_TEAM_ID PLAYER3_TEAM_CITY PLAYER3_TEAM_NICKNAME
## 1 <NA> <NA> <NA> <NA>
## 2 Danny Green 1610612759 San Antonio Spurs
## 3 <NA> <NA> <NA> <NA>
## 4 Kawhi Leonard 1610612759 San Antonio Spurs
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## PLAYER3_TEAM_ABBREVIATION
## 1 <NA>
## 2 SAS
## 3 <NA>
## 4 SAS
## 5 <NA>
## 6 <NA>
Joining the data is pretty simple, because both the play by play data and SportVu use common event IDs. The only issue I have found is the the SportVu data may contain more event IDs (such as the ball going out of bounds), that are not found in the play by play data.
pbp <- pbp[-1,]
colnames(pbp)[2] <- c('event.id')
#Trying to limit the fiels to join to keep the overall size manageable
pbp <- pbp %>% select (event.id,EVENTMSGTYPE,EVENTMSGACTIONTYPE,SCORE)
pbp$event.id <- as.numeric(levels(pbp$event.id))[pbp$event.id]
all.movements <- merge(x = all.movements, y = pbp, by = "event.id", all.x = TRUE)
Extract all data for event ID 303
id303 <- all.movements[which(all.movements$event.id == 303),]
head(id303)
## event.id player_id lastname firstname jersey position team_id
## 1644741 303 -1 ball <NA> <NA> <NA> NA
## 1644742 303 203937 Anderson Kyle 1 F 1610612759
## 1644743 303 201937 Rubio Ricky 9 G 1610612750
## 1644744 303 201988 Mills Patty 8 G 1610612759
## 1644745 303 203952 Wiggins Andrew 22 G-F 1610612750
## 1644746 303 203937 Anderson Kyle 1 F 1610612759
## x_loc y_loc radius game_clock shot_clock quarter
## 1644741 5.43835 24.73073 10.63683 359.75 5.49 3
## 1644742 65.31054 22.12468 0.00000 346.42 19.03 3
## 1644743 46.60167 20.00475 0.00000 376.60 22.70 3
## 1644744 38.77574 21.41917 0.00000 359.40 23.71 3
## 1644745 11.18441 34.04307 0.00000 359.40 23.69 3
## 1644746 8.62043 2.05544 0.00000 364.39 7.11 3
## EVENTMSGTYPE EVENTMSGACTIONTYPE SCORE
## 1644741 1 98 67 - 48
## 1644742 1 98 67 - 48
## 1644743 1 98 67 - 48
## 1644744 1 98 67 - 48
## 1644745 1 98 67 - 48
## 1644746 1 98 67 - 48
The key here is to look at the EVENTMSGTYPE and EVENTMSGACTIONTYPE These fields contain information about the play as well as what happened on the play. I do not have definitive guide to these fields, but here is a starting point:
1 - Make 2 - Miss 3 - Free Throw 4 - Rebound 5 - out of bounds / Turnover / Steal 6 - Personal Foul 7 - Violation 8 - Substitution 9 - Timeout 10 - Jumpball 12 - Start Q1? 13 - Start Q2?
1 - Jumpshot 2 - Lost ball Turnover 3 - ? 4 - Traveling Turnover / Off Foul 5 - Layup 7 - Dunk 10 - Free throw 1-1 11 - Free throw 1-2 12 - Free throw 2-2 40 - out of bounds 41 - Block/Steal 42 - Driving Layup 50 - Running Dunk 52 - Alley Oop Dunk 55 - Hook Shot 57 - Driving Hook Shot 58 - Turnaround hook shot 66 - Jump Bank Shot 71 - Finger Roll Layup 72 - Putback Layup 108 - Cutting Dunk Shot
Just to show the power of the play by play data, lets compare how far Ginobili travels on misses, makes, and rebounds.
ginobili_make <- all.movements[which(all.movements$lastname == "Ginobili" & all.movements$EVENTMSGTYPE == 1),]
ginobili_miss <- all.movements[which(all.movements$lastname == "Ginobili" & all.movements$EVENTMSGTYPE == 2),]
ginobili_rebound <- all.movements[which(all.movements$lastname == "Ginobili" & all.movements$EVENTMSGTYPE == 4),]
#Makes
travelDist(ginobili_make$x_loc, ginobili_make$y_loc)
## [1] 621.9733
#Misses
travelDist(ginobili_miss$x_loc, ginobili_miss$y_loc)
## [1] 311.2476
#Rebounds
travelDist(ginobili_rebound$x_loc, ginobili_rebound$y_loc)
## [1] 361.7619
There are lots of explanation for these numbers, but this should give you an idea of the power of the play by play.
Lets look at what players run the farthest on plays where there is a layup.
player_layup <- all.movements[which(all.movements$EVENTMSGACTIONTYPE == 5),]
player.groups <- group_by(player_layup, lastname)
dist.traveled.players <- summarise(player.groups, totalDist=travelDist(x_loc, y_loc),playerid = max(player_id))
arrange(dist.traveled.players, desc(totalDist))
## Source: local data frame [25 x 3]
##
## lastname totalDist playerid
## (chr) (dbl) (chr)
## 1 ball 211.7860 -1
## 2 Aldridge 193.2782 200746
## 3 Duncan 188.0446 1495
## 4 Dieng 163.8324 203476
## 5 Leonard 161.1590 202695
## 6 Jones 144.9321 1626145
## 7 LaVine 141.2286 203897
## 8 Towns 138.0612 1626157
## 9 Wiggins 132.3646 203952
## 10 Anderson 131.2474 203937
## .. ... ... ...
Lets compare this to the list of players that run the farthest when a layup is made.
player_layup <- all.movements[which(all.movements$EVENTMSGACTIONTYPE == 5 & all.movements$EVENTMSGTYPE == 1),]
player.groups <- group_by(player_layup, lastname)
dist.traveled.players <- summarise(player.groups, totalDist=travelDist(x_loc, y_loc),playerid = max(player_id))
arrange(dist.traveled.players, desc(totalDist))
## Source: local data frame [23 x 3]
##
## lastname totalDist playerid
## (chr) (dbl) (chr)
## 1 ball 125.30158 -1
## 2 Jones 110.31559 1626145
## 3 Dieng 106.61730 203476
## 4 LaVine 103.23012 203897
## 5 Muhammad 86.74623 203498
## 6 Duncan 83.20876 1495
## 7 Leonard 77.70864 202695
## 8 West 76.24240 2561
## 9 Parker 73.74633 2225
## 10 Ginobili 71.06211 1938
## .. ... ... ...
You can see that the list changes, because not every layup results in a made basket. These examples illustrate the power of using the play by play data.
I hope this helps people combine the SportVu data with the play by play data. I had some great help figuring all of this out. I need to credit Justin, Darrly Blackport, and Grant Fiddyment.
For more of my explorations on the NBA data you can see my NBA Github repo. You can find more information about me, Rajiv Shah or my other projects or find me on Twitter.