1 Introduction

In this essay we address the problem of estimating, from data provided by the FRA in the WABS databases, the probability of a collision at specific rail crossings in a year. WABS itself makes some estimates, but does not document the method used for making them. We are not trying to replicate the WABS results, but instead develop here an independent method based on logistic regression.

Logistic regression is a way of estimating the probability of an event, such as an accident or collision, happening. The response variable is a binary variable indicating whether an accident occurs or not. Predictor variables are taken from the data fields provided by WABS. Logistic regression allows either continuous or discrete values of predictors, including factors (which might be character data in the data file). It uses a version of regression which chooses coefficients based on maximum likelihood, instead of least square error, which is used in ordinary regression. In addition, the response variable does not follow a normal distribution, but a binomial one, so that ordinary least squares methods of analysis of results are inappropriate.

2 Data

WABS Accident data is contained in pdf documents downloadable from their website. The files are not downloadable in any other format. We extracted the data from the pdf and edited the files, using Notepad+, to get columns and field names aligned properly, and then used the RStudio import data feature to get it in.

3 Organizing Data

We first create a shorter, more generic name for the data frame. A skim of the data shows some NAs in AADT, and one in HWYLNS. We have to replace these

dataf = WABS_2019_01_28
dataf %>% glimpse

## Rows: 80
## Columns: 22
## $ RANK       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ PRED_COLLS <dbl> 0.171142, 0.129295, 0.050374, 0.049383, 0.048540, 0.046391,…
## $ CROSSING   <chr> "498663J", "498566A", "498688E", "498564L", "498568N", "498…
## $ RR         <chr> "SMRT", "SMRT", "SMRT", "SMRT", "SMRT", "SMRT", "SMRT", "SM…
## $ STATE      <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA",…
## $ COUNTY     <chr> "SONOMA", "SONOMA", "SONOMA", "SONOMA", "SONOMA", "SONOMA",…
## $ CITY       <chr> "SANTAROSA", "SANTAROSA", "PETALUMA", "SANTAROSA", "SANTARO…
## $ ROAD       <chr> "HEARN_AVE", "STEELE_LANE", "WASHINGTON_STREET", "COLLEGE_A…
## $ C          <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ C1         <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ C2         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ C3         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ C4         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DATE_CHG   <chr> "07/17", "07/17", NA, NA, NA, NA, NA, "05/17", NA, NA, NA, …
## $ WD         <chr> "GT", "GT", "GT", "GT", "GT", "GT", "FQ", "GT", "GT", "FQ",…
## $ TOT_TRN    <dbl> 64, 64, 60, 64, 64, 10, 60, 60, 64, 60, 64, 64, 64, 72, 72,…
## $ TOT_TRK    <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ TTBL_SPD   <dbl> 79, 70, 25, 60, 79, 40, 15, 45, 40, 25, 25, 79, 79, 50, 40,…
## $ HWYPVD     <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YE…
## $ HWYLNS     <dbl> 3, 3, 5, 4, 4, 2, 2, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 2, 3, 2,…
## $ AADT       <dbl> 20600, 8100, 21549, 26900, 23900, 5024, 48000, 8500, 27300,…
## $ PLUSCODE   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "84CVF747+C5M5", NA, NA, NA…

HWYLNS should probably be replaced with a 2, since it is the number of highway lanes at the crossing, and could not be 0. And the AADT ones should be replaced with zeroes. We also note that HWYPVD is always YES, so it need not be considered as a discriminating response variable.

dataf$HWYLNS[which(is.na(dataf$HWYLNS) )] = 2
dataf$AADT[which(is.na(dataf$AADT))] = 0

Now the data look good for analysis. Here is a list of the column names:

names(dataf)

##  [1] "RANK"       "PRED_COLLS" "CROSSING"   "RR"         "STATE"     
##  [6] "COUNTY"     "CITY"       "ROAD"       "C"          "C1"        
## [11] "C2"         "C3"         "C4"         "DATE_CHG"   "WD"        
## [16] "TOT_TRN"    "TOT_TRK"    "TTBL_SPD"   "HWYPVD"     "HWYLNS"    
## [21] "AADT"       "PLUSCODE"

RANK and PRED_COLLS show the results of the WABS estimation. RANK is the order high to low on PRED_COLLS, PRED_COLLS is the estimate of the probability of a collision at the crossing in a year computed by WABS. The data have ranked the crossings in descending order of probability of a collision.

CROSSING, RR, STATE, COUNTY, CITY, and ROAD describe the exact location of the crossing. To create this file we selected only crossings on the SMART train line in Sonoma and Marin counties, California. Thus RR always is the acronym SMRT, STATE is always California, and COUNTY is either Sonoma or Marin. The fields beginning with C are the accident or collision counts for the current cycle (representing one year), and previous cycles of one year length. Since the SMART line only operated in 2017 and 2018, the last 3 fields here are all zeros. WD is a code for the type of equipment at the crossing. TOT_TRN is the number of trains per day, TOT_TRK the number of tracks at the crossing, TTBL_SPD the maximum speed permitted by the timetable schedule, HWYPVD a code for whether the crossing is paved or not (in this sample always YES, so it is not a candidate for a predictor). HWYLNS is the number of highway lanes going through the crossing, and AADT is the average daily vehicle traffic count at the crossing. Except for HWYPVD, the variables after DATE_CHG, which represents whether there was a change to the entry made by the railroad during the cycles included (perhaps a re-engineering of the crossing equipment), are all candidates for predictors of the collision likelihood.

Here are the key predictions about the crossings: you can search the table to see highs and lows.

dataf1 = dataf[,c(1:2,7:8,15:18, 20:21)]
dataf1

## # A tibble: 80 × 10
##     RANK PRED_COLLS CITY       ROAD  WD    TOT_TRN TOT_TRK TTBL_SPD HWYLNS  AADT
##    <dbl>      <dbl> <chr>      <chr> <chr>   <dbl>   <dbl>    <dbl>  <dbl> <dbl>
##  1     1     0.171  SANTAROSA  HEAR… GT         64       1       79      3 20600
##  2     2     0.129  SANTAROSA  STEE… GT         64       1       70      3  8100
##  3     3     0.0504 PETALUMA   WASH… GT         60       1       25      5 21549
##  4     4     0.0494 SANTAROSA  COLL… GT         64       2       60      4 26900
##  5     5     0.0485 SANTAROSA  PINE… GT         64       1       79      4 23900
##  6     6     0.0464 ROHNERTPA… GOLF… GT         10       1       40      2  5024
##  7     7     0.0458 SANRAFAEL  4TH_… FQ         60       2       15      2 48000
##  8     8     0.0455 SANRAFAEL  N_SA… GT         60       1       45      2  8500
##  9     9     0.0426 SANTAROSA  GUER… GT         64       2       40      2 27300
## 10    10     0.0424 PETALUMA   EAST… FQ         60       1       25      3 18371
## # ℹ 70 more rows

dataf1 %>% summary

##       RANK         PRED_COLLS           CITY               ROAD          
##  Min.   : 1.00   Min.   :0.000057   Length:80          Length:80         
##  1st Qu.:20.75   1st Qu.:0.008430   Class :character   Class :character  
##  Median :40.50   Median :0.015378   Mode  :character   Mode  :character  
##  Mean   :40.50   Mean   :0.022960                                        
##  3rd Qu.:60.25   3rd Qu.:0.034537                                        
##  Max.   :80.00   Max.   :0.171142                                        
##       WD               TOT_TRN         TOT_TRK         TTBL_SPD    
##  Length:80          Min.   : 0.00   Min.   :0.000   Min.   : 0.00  
##  Class :character   1st Qu.: 8.00   1st Qu.:1.000   1st Qu.:25.00  
##  Mode  :character   Median :11.00   Median :1.000   Median :40.00  
##                     Mean   :34.41   Mean   :1.238   Mean   :38.73  
##                     3rd Qu.:61.00   3rd Qu.:1.000   3rd Qu.:45.00  
##                     Max.   :72.00   Max.   :3.000   Max.   :79.00  
##      HWYLNS           AADT      
##  Min.   :0.000   Min.   :    0  
##  1st Qu.:2.000   1st Qu.:  700  
##  Median :2.000   Median : 2600  
##  Mean   :2.163   Mean   : 6512  
##  3rd Qu.:2.000   3rd Qu.: 8275  
##  Max.   :5.000   Max.   :48000

Only 2 have higher predictions than 10%, and the last 10 are all under 0.4%. The last 5 are pedestrian crossings.

4 Data preview

To gain some insight into the data we plot the predicted collisions against the six chosen variables.

chosen = c(2, 16:18, 20:21)
ggpairs(dataf[,chosen], progress=F)

We see there are a couple of outliers with very high PRED_COLLS.

dataf[,chosen] %>% filter(PRED_COLLS>=.10)

## # A tibble: 2 × 6
##   PRED_COLLS TOT_TRN TOT_TRK TTBL_SPD HWYLNS  AADT
##        <dbl>   <dbl>   <dbl>    <dbl>  <dbl> <dbl>
## 1      0.171      64       1       79      3 20600
## 2      0.129      64       1       70      3  8100

These occur with high TOT_TRN, TOT_TRK = 2, high TTBL_SPD, and 3 HWYLNS; AADT for these is medium in one case and high in another. Otherwise there is not a clear pattern.

dataf[,chosen] %>% filter(PRED_COLLS<=.01)

## # A tibble: 25 × 6
##    PRED_COLLS TOT_TRN TOT_TRK TTBL_SPD HWYLNS  AADT
##         <dbl>   <dbl>   <dbl>    <dbl>  <dbl> <dbl>
##  1    0.00991      10       1       40      2  1000
##  2    0.00964      10       1       35      2   900
##  3    0.00934      10       1       40      2   800
##  4    0.00908       8       1       40      2   900
##  5    0.00849       8       1       40      2   700
##  6    0.00824      10       1       40      2   500
##  7    0.00815       8       1       40      2   600
##  8    0.00812      60       1       45      2   100
##  9    0.00776      12       2       20      2   300
## 10    0.00766       5       1       10      2  1104
## # ℹ 15 more rows

Very low PRED_COLLS go with low values of AADT,TTBL_SPD, and TOT_TRN.

5 Logistic regression analysis

A preliminary analysis was done using logistic regression; we used the glm routine in R. We tried a model for C using all the predictors.

glm.fit1=glm(C~as.factor(WD)+TOT_TRN+TOT_TRK+TTBL_SPD+HWYLNS+AADT, data=dataf,family="binomial")
glm.fit1 %>% glance()

## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          25.6      79  -9.06  42.1  70.7     18.1          68    80

glm.fit1 %>% tidy %>% gt_add_significance()

term	estimate	std.error	statistic	p.value
(Intercept)	−17.2769	7,279.2429	−0.0024	0.9981
as.factor(WD)FQ	−3.5339	8,758.4122	−0.0004	0.9997
as.factor(WD)GT	12.6775	7,279.2365	0.0017	0.9986
as.factor(WD)HS	−2.4090	12,306.0119	−0.0002	0.9998
as.factor(WD)NO	−3.6173	14,497.2508	−0.0002	0.9998
as.factor(WD)SS	−0.2371	19,166.4623	0.0000	1.0000
as.factor(WD)XB	−0.4972	19,166.4623	0.0000	1.0000
TOT_TRN	−0.1433	0.1433	−1.0002	0.3172
TOT_TRK	−3.9823	11.1881	−0.3559	0.7219
TTBL_SPD	0.2057	0.1780	1.1555	0.2479
HWYLNS	−0.4243	1.2349	−0.3436	0.7312
AADT	0.0001	0.0001	0.8817	0.3779

A smaller Akaike Information Criterion (AIC) would represent a better fit; one cannot use \(R^2\), the conventional least squares regression measure of fit.

Interpreting the coefficients, we see that none of the WD crossing type factors contributed anything to the result (p close to 1). TOT_TRK and HWYLNS were also very likely (p=.772, p=.731) to be unrelated. Of the remaining 3 predictors, TTBL_SPD (timetable train speed) was the most likely to have an effect, and the effect is positive on the probability– the coefficient has a positive sign. The number of trains TOT_TRN is next most effective, in a negative direction– more trains through a crossing make the probability of a collision lower. AADT, the number or autos per day through the crossing, is third, with again a positive effect– more would predict a greater probability of collision – however the coefficient is almost 0.

As a check we can use the caret package to assess the importance of the variables.

library(caret); library(dplyr)
round(varImp(glm.fit1,scale=TRUE),4)

##                 Overall
## as.factor(WD)FQ  0.0004
## as.factor(WD)GT  0.0017
## as.factor(WD)HS  0.0002
## as.factor(WD)NO  0.0002
## as.factor(WD)SS  0.0000
## as.factor(WD)XB  0.0000
## TOT_TRN          1.0002
## TOT_TRK          0.3559
## TTBL_SPD         1.1555
## HWYLNS           0.3436
## AADT             0.8817

Clearly TTBL_SPD is most important; TOT_TRN is close behind; AADT is third; and TOT_TRK and HWYLNS are quite a bit less important.

6 Reduced form of logistic regression

It therefore makes sense to rerun the logistic regression, without the WD factors and TOT_TRK, HWYLNS, to see if a better result is obtained.

glm.fit3=glm(C~TOT_TRN+TTBL_SPD+AADT, data=dataf, family="binomial")
glm.fit3 %>% glance

## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          25.6      79  -9.46  26.9  36.5     18.9          76    80

glm.fit3 %>% tidy %>% gt_add_significance()

term	estimate	std.error	statistic	p.value
(Intercept)	−10.3912	6.1513	−1.6893	0.0912
TOT_TRN	−0.1690	0.1464	−1.1546	0.2482
TTBL_SPD	0.2316	0.1831	1.2649	0.2059
AADT	0.0001	0.0001	0.9104	0.3626

Now we see all predictor coefficients have relatively low p scores– below .37. The coefficients all go in the expected direction, if you consider that when there are lots of trains through a crossing each day, people tend to be more careful at the intersection.

7 Predicting probabilities of collision

We can use the results of the latter logistic regression to predict the probability of a collision at each crossing. Our results will be different from those of the WABS. We build a report to show the results for the crossings. The report will show city and road, type of equipment at the crossing, the predicted probability of collision, and the standard error of the prediction, which gives an idea of how confident we are in the prediction. it could be used to calculate confidence intervals on the predictions. We marked with a * the crossings at which the standard error was small enough so that Prediction was greater than SE, and hence one standard error below the Prediction was still positive probability of collision.

##           City                     Road Type Prediction  SEfit sig
## 5    SANTAROSA               PINER_ROAD   GT    0.30700 0.2876   *
## 16    PETALUMA              CORONA_ROAD   GT    0.26425 0.2190   *
## 1    SANTAROSA                HEARN_AVE   GT    0.24901 0.2239   *
## 71      FULTON            AVIATION_BLVD   GT    0.19704 0.2095    
## 40      FULTON               RIVER_ROAD   GT    0.14239 0.1141   *
## 37      COTATI          EAST_COTATI_AVE   GT    0.10823 0.0774   *
## 38 ROHNERTPARK           SOUTHWEST_BLVD   GT    0.09945 0.0686   *
## 48     WINDSOR         WINDSOR_RIVER_RD   GT    0.09902 0.0721   *
## 53      FULTON              FULTON_ROAD   GT    0.09146 0.0675   *
## 22   SANTAROSA       BELLEVUE/CORBY_AVE   FQ    0.08986 0.1151    
## 6  ROHNERTPARK        GOLF_COURSE_DRIVE   GT    0.08493 0.0571   *
## 59     WINDSOR               STARR_ROAD   GT    0.08308 0.0634   *
## 60      FULTON              SHILOH_ROAD   GT    0.08175 0.0628   *
## 62      FULTON             AIRPORT_BLVD   GT    0.08109 0.0626   *
## 69     WINDSOR            MITCHELL_LANE   GT    0.07851 0.0615   *
## 31      SONOMA         SEARSPOINT/ST-37   GT    0.07651 0.1969    
## 12   SANTAROSA          SAN_MIGUEL_ROAD   GT    0.07650 0.1084    
## 13   SANTAROSA               BARHAM_AVE   GT    0.07406 0.1071    
## 39 ROHNERTPARK        ROHNERT_PARK_EXPY   GT    0.06865 0.0492   *
## 54  HEALDSBURG            BAILHACHE_AVE   GT    0.06430 0.0478   *
## 55      COTATI        EAST_RAILROAD_AVE   GT    0.06325 0.0476   *
## 56   SANTAROSA          WEST_ROBLES_AVE   GT    0.06120 0.0470   *
## 17   SANTAROSA                TODD_ROAD   GT    0.06070 0.0469   *
## 58 ROHNERTPARK               SCENIC_AVE   GT    0.06020 0.0468   *
## 61  HEALDSBURG            LIMERICK_LANE   GT    0.05873 0.0465   *
## 68  HEALDSBURG               GRANT_AVE.   GT    0.05729 0.0462   *
## 28    PETALUMA           NMCDOWELL_BLVD   GT    0.05547 0.0573    
## 47   PENNGROVE         MAIN_ST/WOODWARD   GT    0.02283 0.0278    
## 50   PENNGROVE           OLD_ADOBE_ROAD   GT    0.02168 0.0271    
## 57    PETALUMA               ELY_ROAD_N   GT    0.01990 0.0261    
## 43   SANRAFAEL               2ND_STREET   GT    0.01534 0.0399    
## 2    SANTAROSA              STEELE_LANE   GT    0.01358 0.0323    
## 74  HEALDSBURG        GRANT_SCHOOL_ROAD   NO    0.01011 0.0175    
## 75    PETALUMA         SOUTH_POINT_ROAD   NO    0.01011 0.0175    
## 52   SANRAFAEL               3RD_STREET   GT    0.00828 0.0209    
## 51      SONOMA          SR_ROUTE_12/121   GT    0.00765 0.0154    
## 41  HEALDSBURG           HEALDSBURG_AVE   GT    0.00742 0.0158    
## 4    SANTAROSA           COLLEGE_AVENUE   GT    0.00703 0.0230    
## 30   SANRAFAEL              IRWINSTREET   HS    0.00378 0.0101    
## 70      SONOMA        TOULAY_CREEK_ROAD   HS    0.00369 0.0092    
## 45      SONOMA          SR_ROUTE_12/121   GT    0.00273 0.0078    
## 46  HEALDSBURG             FRONT_STREET   GT    0.00234 0.0068    
## 27      NOVATO             ROBLAR_DRIVE   GT    0.00151 0.0059    
## 67      SONOMA         SKAGGS_ISLAND_RD   GT    0.00144 0.0046    
## 42      NOVATO            HAMILTON_PKWY   GT    0.00137 0.0055    
## 79      NOVATO            BIKE_PATH_PED   GT    0.00131 0.0053    
## 23   SANTAROSA          SEBASTOPOL_ROAD   GT    0.00114 0.0048    
## 64    PETALUMA            HOPPER_STREET   FL    0.00043 0.0018    
## 21    PETALUMA          W_PAYRAN_STREET   GT    0.00029 0.0015    
## 49   SANRAFAEL           FRANCISCO_BLVD   XB    0.00029 0.0013    
## 66   SANRAFAEL              RICE_STREET   FL    0.00029 0.0013    
## 24   SANRAFAEL           SMITH_RANCH_RD   GT    0.00023 0.0012    
## 72   SANTAROSA             STANDISH_AVE   SS    0.00022 0.0010    
## 44      NOVATO           HANNA_RANCH_RD   GT    0.00017 0.0010    
## 77      NOVATO    Franklin_Ped_crossing   FL    0.00017 0.0010    
## 65        NAPA              MILTON_ROAD   GT    0.00015 0.0007    
## 76      NOVATO                 PED_XING   HS    0.00013 0.0007    
## 8    SANRAFAEL          N_SANPEDRO_ROAD   GT    0.00009 0.0005    
## 9    SANTAROSA         GUERNEVILLE_ROAD   GT    0.00007 0.0005    
## 25    PETALUMA         LAKEVILLE_STREET   GT    0.00007 0.0004    
## 14      NOVATO                GRANT_AVE   GT    0.00004 0.0003    
## 32   SANRAFAEL           PACHECO_STREET   GT    0.00004 0.0002    
## 63   SANRAFAEL                 PED_XING   FL    0.00004 0.0003    
## 73   SANRAFAEL            PALOMA_AVENUE   GT    0.00004 0.0003    
## 26   SANTAROSA               9TH_STREET   FQ    0.00003 0.0002    
## 80 ROHNERTPARK                 PED_XING   GT    0.00003 0.0002    
## 29      NOVATO          RUSHCREEK_PLACE   FQ    0.00002 0.0001    
## 3     PETALUMA        WASHINGTON_STREET   GT    0.00000 0.0000    
## 7    SANRAFAEL               4TH_STREET   FQ    0.00000 0.0000    
## 10    PETALUMA            EAST_D_STREET   FQ    0.00000 0.0000    
## 11   SANTAROSA               3RD_STREET   GT    0.00000 0.0000    
## 15      NOVATO                OLIVE_AVE   GT    0.00000 0.0000    
## 18   SANRAFAEL           MISSION_AVENUE   FQ    0.00000 0.0000    
## 19   SANRAFAEL          CIVIC_CENTER_DR   GT    0.00000 0.0000    
## 20   SANRAFAEL               5TH_AVENUE   FQ    0.00000 0.0000    
## 33   SANTAROSA               8TH_STREET   GT    0.00000 0.0000    
## 34   SANTAROSA               7TH_STREET   GT    0.00000 0.0000    
## 35   SANTAROSA               6TH_STREET   GT    0.00000 0.0000    
## 36    PETALUMA           CAULFIELD_LANE   GT    0.00000 0.0000    
## 78   SANTAROSA Railroad_square_ped_xing   FL    0.00000 0.0000

Let us select all the data with a Prediction (probability of collison) greater than 1 in 100 (0.01).

d %>% filter(Prediction>.01)

##           City               Road Type Prediction  SEfit sig
## 5    SANTAROSA         PINER_ROAD   GT    0.30700 0.2876   *
## 16    PETALUMA        CORONA_ROAD   GT    0.26425 0.2190   *
## 1    SANTAROSA          HEARN_AVE   GT    0.24901 0.2239   *
## 71      FULTON      AVIATION_BLVD   GT    0.19704 0.2095    
## 40      FULTON         RIVER_ROAD   GT    0.14239 0.1141   *
## 37      COTATI    EAST_COTATI_AVE   GT    0.10823 0.0774   *
## 38 ROHNERTPARK     SOUTHWEST_BLVD   GT    0.09945 0.0686   *
## 48     WINDSOR   WINDSOR_RIVER_RD   GT    0.09902 0.0721   *
## 53      FULTON        FULTON_ROAD   GT    0.09146 0.0675   *
## 22   SANTAROSA BELLEVUE/CORBY_AVE   FQ    0.08986 0.1151    
## 6  ROHNERTPARK  GOLF_COURSE_DRIVE   GT    0.08493 0.0571   *
## 59     WINDSOR         STARR_ROAD   GT    0.08308 0.0634   *
## 60      FULTON        SHILOH_ROAD   GT    0.08175 0.0628   *
## 62      FULTON       AIRPORT_BLVD   GT    0.08109 0.0626   *
## 69     WINDSOR      MITCHELL_LANE   GT    0.07851 0.0615   *
## 31      SONOMA   SEARSPOINT/ST-37   GT    0.07651 0.1969    
## 12   SANTAROSA    SAN_MIGUEL_ROAD   GT    0.07650 0.1084    
## 13   SANTAROSA         BARHAM_AVE   GT    0.07406 0.1071    
## 39 ROHNERTPARK  ROHNERT_PARK_EXPY   GT    0.06865 0.0492   *
## 54  HEALDSBURG      BAILHACHE_AVE   GT    0.06430 0.0478   *
## 55      COTATI  EAST_RAILROAD_AVE   GT    0.06325 0.0476   *
## 56   SANTAROSA    WEST_ROBLES_AVE   GT    0.06120 0.0470   *
## 17   SANTAROSA          TODD_ROAD   GT    0.06070 0.0469   *
## 58 ROHNERTPARK         SCENIC_AVE   GT    0.06020 0.0468   *
## 61  HEALDSBURG      LIMERICK_LANE   GT    0.05873 0.0465   *
## 68  HEALDSBURG         GRANT_AVE.   GT    0.05729 0.0462   *
## 28    PETALUMA     NMCDOWELL_BLVD   GT    0.05547 0.0573    
## 47   PENNGROVE   MAIN_ST/WOODWARD   GT    0.02283 0.0278    
## 50   PENNGROVE     OLD_ADOBE_ROAD   GT    0.02168 0.0271    
## 57    PETALUMA         ELY_ROAD_N   GT    0.01990 0.0261    
## 43   SANRAFAEL         2ND_STREET   GT    0.01534 0.0399    
## 2    SANTAROSA        STEELE_LANE   GT    0.01358 0.0323    
## 74  HEALDSBURG  GRANT_SCHOOL_ROAD   NO    0.01011 0.0175    
## 75    PETALUMA   SOUTH_POINT_ROAD   NO    0.01011 0.0175

The table shows 34 crossings have probabilities of a collision in the next year greater than 0.01 or 1 in 100.

If we look at the lowest ones, we will screen for the ones with a Prediction less than .0001, or one in ten thousand. We obtain 23 crossings with less than 1 in 10000.

d %>% filter(Prediction < 0.0001)

##           City                     Road Type Prediction SEfit sig
## 8    SANRAFAEL          N_SANPEDRO_ROAD   GT      9e-05 5e-04    
## 9    SANTAROSA         GUERNEVILLE_ROAD   GT      7e-05 5e-04    
## 25    PETALUMA         LAKEVILLE_STREET   GT      7e-05 4e-04    
## 14      NOVATO                GRANT_AVE   GT      4e-05 3e-04    
## 32   SANRAFAEL           PACHECO_STREET   GT      4e-05 2e-04    
## 63   SANRAFAEL                 PED_XING   FL      4e-05 3e-04    
## 73   SANRAFAEL            PALOMA_AVENUE   GT      4e-05 3e-04    
## 26   SANTAROSA               9TH_STREET   FQ      3e-05 2e-04    
## 80 ROHNERTPARK                 PED_XING   GT      3e-05 2e-04    
## 29      NOVATO          RUSHCREEK_PLACE   FQ      2e-05 1e-04    
## 3     PETALUMA        WASHINGTON_STREET   GT      0e+00 0e+00    
## 7    SANRAFAEL               4TH_STREET   FQ      0e+00 0e+00    
## 10    PETALUMA            EAST_D_STREET   FQ      0e+00 0e+00    
## 11   SANTAROSA               3RD_STREET   GT      0e+00 0e+00    
## 15      NOVATO                OLIVE_AVE   GT      0e+00 0e+00    
## 18   SANRAFAEL           MISSION_AVENUE   FQ      0e+00 0e+00    
## 19   SANRAFAEL          CIVIC_CENTER_DR   GT      0e+00 0e+00    
## 20   SANRAFAEL               5TH_AVENUE   FQ      0e+00 0e+00    
## 33   SANTAROSA               8TH_STREET   GT      0e+00 0e+00    
## 34   SANTAROSA               7TH_STREET   GT      0e+00 0e+00    
## 35   SANTAROSA               6TH_STREET   GT      0e+00 0e+00    
## 36    PETALUMA           CAULFIELD_LANE   GT      0e+00 0e+00    
## 78   SANTAROSA Railroad_square_ped_xing   FL      0e+00 0e+00

8 Importance of variables

Which variable is most important? The package caret gives us a method for deciding. We see below that TTBL_SPD is a bit more important than TOT_TRN, and AADT is less important. But they are not greatly different in importance.

library(caret)
caret::varImp(glm.fit3,scale=TRUE)

##            Overall
## TOT_TRN  1.1546445
## TTBL_SPD 1.2648867
## AADT     0.9104228

9 Residuals

Residual analysis is important as well, to see if we have a decent model for prediction. We call this candidate model LOGR.

glm.fit3_data = data.frame( 
  RANK=dataf$RANK, 
  name=paste0(dataf$CITY, "; ", 
              substr(dataf$ROAD,1,20) ),
  std_resids = scale(glm.fit3$residuals),  
  C = dataf$C, 
  L=glm.fit3$fitted.values, 
  W=dataf$PRED_COLLS )

ggplot(glm.fit3_data, aes( RANK, std_resids, color=factor(C) ) )  + 
  geom_point(alpha = .5) +
  geom_hline(yintercept = 3, color = "orange") + 
  geom_hline( yintercept=0, color = "blue") +
  labs( x = "RANK by WABS probability", y="Standardized residual", 
        title= "Standardized residuals of crossings using LOGR regression", 
        subtitle = "Crossings with collisons have large positive standard errors, others negative standard errors." )

As the above graph and table following show, only 3 crossings have residuals over 0, and these are the ones with collisions actually occurring. Which are these?

glm.fit3_data %>% filter ( std_resids  > 0 )

##    RANK                           name std_resids C          L        W
## 1     1           SANTAROSA; HEARN_AVE   1.868924 1 0.24901172 0.171142
## 6     6 ROHNERTPARK; GOLF_COURSE_DRIVE   5.012333 1 0.08492718 0.046391
## 17   17           SANTAROSA; TODD_ROAD   6.916577 1 0.06069780 0.036735

10 Comparing with the WABS predictions

We will now compare our predictions from LOGR with the WABS predictions in the field PRED_COLLS. The easiest way to do this is via a graph, plotting the PRED against PRED_COLLS.

The graph shows that for the most part LOGR predicts a higher probability than WABS, particularly when the probability is over about .025, or 2.5%.

Let us then exclude all the ones with WABS predicted collision likelihood less than .02, and the two with WABS prediction above 0.1. This will show us the graph in a more detailed scale. Even in this smaller group it is clear that the LOGR method is obtaining a higher probability than the WABS predictions.

Comparisons of WABS and LOGR predictions

11 Predictions using interaction terms

Because of this upward bias in the LOGR predictions, we now revise the LOGA model to include only the interaction between TTBL_SPD and AADT. AIC is now negative, signifying a better fit. We first show a histogram of the distribution of prediction gap between LOGA and WABS. There are only 6 outliers with more absolute gap than 0.05.

glm.spdxaadt = glm(glm(C~TTBL_SPD:AADT, data=dataf, family="binomial"))  #+TTBL_SPD+AADT
s1=summary(glm.spdxaadt)
newdata = data.frame( HWYLNS= dataf$HWYLNS,TOT_TRN=dataf$TOT_TRN, TTBL_SPD=dataf$TTBL_SPD, AADT=dataf$AADT)
spdxaadt.pred=predict(glm.spdxaadt, newdata, type="response", se.fit=TRUE)
glm.spdxaadt_data = data.frame( RANK=dataf$RANK, 
                                name=paste0(dataf$CITY,
                                            "; ", 
                                            substr(dataf$ROAD,1,20) ),
                                std_resids = round(scale(glm.spdxaadt$residuals),4),  
                                C = dataf$C,
                                L=round(glm.spdxaadt$fitted.values,4),
                                W=round(dataf$PRED_COLLS,4) ) %>% 
  mutate ( gap=round(L-W, 4) )
ggplot(glm.spdxaadt_data, aes(gap)) +
  geom_histogram (color="red", alpha=0.2, size=1) +
  coord_cartesian( ylim= c(0,30), xlim=c(-0.05, 0.15)) +
  geom_vline ( xintercept = 0) +
  labs(title=paste0("Distribution of predictions from LOGA with TTBL_SPD:AADT \n AIC = ", round(s1$aic,3) ) )

We display the big 6 in descending order of the gap between LOGA and WABS (LOGA - WABS). Only one gap is negative.

glm.spdxaadt_data %>% filter ( !between (gap, -.05,.05) ) %>% 
  arrange ( desc(abs(gap) ) ) %>% gt

RANK	name	std_resids	L	W	gap
5	SANTAROSA; PINER_ROAD	-1.0467	0.1964	0.0485	0.1479
4	SANTAROSA; COLLEGE_AVENUE	-0.9047	0.1697	0.0494	0.1203
16	PETALUMA; CORONA_ROAD	-0.6337	0.1189	0.0371	0.0818
31	SONOMA; SEARSPOINT/ST-37	-0.5296	0.0994	0.0212	0.0782
9	SANTAROSA; GUERNEVILLE_ROAD	-0.6343	0.1190	0.0426	0.0764
2	SANTAROSA; STEELE_LANE	-0.3623	0.0680	0.1293	-0.0613

12 Improving on LOGA

We try adding TOT_TRN into LOGA. the new AIC is -34.92129, not as good as the previous try. Now there are 10 that miss by more than 0.05. So we reject this version.

glm.spdxaadttrn = 
  glm(C~TOT_TRN + TTBL_SPD:AADT, data=dataf, family="binomial")  #+TTBL_SPD+AADT+TOT_TRN
s2 =summary(glm.spdxaadttrn)
newdata = data.frame( 
  HWYLNS= dataf$HWYLNS,
  TOT_TRN=dataf$TOT_TRN, 
  TTBL_SPD=dataf$TTBL_SPD, 
  AADT=dataf$AADT)
spdxaadttrn.pred=predict(glm.spdxaadttrn, newdata, type="response", se.fit=TRUE)
glm.spdxaadttrn_data = data.frame( 
  RANK=dataf$RANK, 
  name=paste0(dataf$CITY,"; ", substr(dataf$ROAD,1,16) ),
  std_resids = round(scale(glm.spdxaadttrn$residuals),4),  
  C = dataf$C, 
  L=round(glm.spdxaadttrn$fitted.values,4), 
  W=round(dataf$PRED_COLLS,4) 
  ) %>% 
  mutate ( gap=round(L-W, 4) )
glm.spdxaadttrn_data %>% filter ( !between (gap, -.05,.05) ) %>% arrange ( desc(abs(gap) ) )

##    RANK                      name std_resids C      L      W     gap
## 5     5     SANTAROSA; PINER_ROAD    -0.2562 0 0.2998 0.0485  0.2513
## 31   31  SONOMA; SEARSPOINT/ST-37    -0.2228 0 0.2146 0.0212  0.1934
## 4     4 SANTAROSA; COLLEGE_AVENUE    -0.2119 0 0.1824 0.0494  0.1330
## 2     2    SANTAROSA; STEELE_LANE    -0.1677 0 0.0182 0.1293 -0.1111
## 43   43     SANRAFAEL; 2ND_STREET    -0.1898 0 0.1076 0.0141  0.0935
## 52   52     SANRAFAEL; 3RD_STREET    -0.1823 0 0.0792 0.0115  0.0677
## 28   28  PETALUMA; NMCDOWELL_BLVD    -0.1821 0 0.0784 0.0236  0.0548

Now there are 10 above 0.05. This is something of a fit but still is skewed right and bimodal. Personally I’d prefer the one with just the two variable interaction. We plot the histogram below.

ggplot(glm.spdxaadttrn_data, aes(gap)) +
  geom_histogram (color="red", alpha=0.2, size=1) +
  coord_cartesian( ylim= c(0,30), xlim=c(-0.05, 0.20)) +
  geom_vline ( xintercept = 0)+
  labs(title=paste0(
    "Distribution of predictions from LOGA with TOT_TRN + TTBL_SPD:AADT \n AIC = ",
    round(s2$aic,3) ) )

13 Extending model with `TOT_TRN`

Now we add in HWYLNS as a factor with TOT_TRN. We now obtain 17 gaps outside the (-0.05,0.05) interval.

glm.spdxaadtln = glm(C~as.factor(HWYLNS) + TOT_TRN + TTBL_SPD:AADT, 
                     data=dataf, 
                     family="binomial")  #+TTBL_SPD+AADT+TOT_TRN
s3=summary(glm.spdxaadtln)
newdata = data.frame( HWYLNS= dataf$HWYLNS,TOT_TRN=dataf$TOT_TRN, TTBL_SPD=dataf$TTBL_SPD, AADT=dataf$AADT)
spdxaadtln.pred=predict(glm.spdxaadtln, newdata, type="response", se.fit=TRUE)
glm.spdxaadtln_data = data.frame( RANK=dataf$RANK, 
                                  name=paste0(dataf$CITY, "; ", substr(dataf$ROAD,1,20) ),
                                  std_resids = round(scale(glm.spdxaadtln$residuals),4),  
                                  C = dataf$C,
                                  L=round(glm.spdxaadtln$fitted.values,4),
                                  W=round(dataf$PRED_COLLS,4) ) %>%
  mutate ( gap=round(L-W, 4) )

glm.spdxaadtln_data %>% filter ( !between (gap, -.05,.05) ) %>%
  arrange ( desc(abs(gap) ) ) %>% gt

RANK	name	std_resids	C	L	W	gap
1	SANTAROSA; HEARN_AVE	0.3681	1	0.5813	0.1711	0.4102
31	SONOMA; SEARSPOINT/ST-37	-0.2332	0	0.2894	0.0212	0.2682
43	SANRAFAEL; 2ND_STREET	-0.2271	0	0.2731	0.0141	0.2590
2	SANTAROSA; STEELE_LANE	-0.1655	0	0.0527	0.1293	-0.0766
52	SANRAFAEL; 3RD_STREET	-0.1725	0	0.0840	0.0115	0.0725
16	PETALUMA; CORONA_ROAD	-0.1774	0	0.1051	0.0371	0.0680
9	SANTAROSA; GUERNEVILLE_ROAD	-0.1746	0	0.0932	0.0426	0.0506
3	PETALUMA; WASHINGTON_STREET	-0.1549	0	0.0000	0.0504	-0.0504

The distribution of gaps is shown here.

The anova table indicates that the interaction of TTBL_SPD and AADT makes predictions whose error is not significantly different from that obtained by adding HWYLNS and TOT_TRN. We should therefore stick with the interaction only model. Six crossings bear closer investigation, since the predictions of the LOGR model are widely different from those of the WABS model. Adding HWYLNS as a factor does not improve the regression.

anova(glm.spdxaadtln, glm.spdxaadt, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: C ~ as.factor(HWYLNS) + TOT_TRN + TTBL_SPD:AADT
## Model 2: C ~ TTBL_SPD:AADT
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        73    19.7074                     
## 2        78     2.7802 -5   16.927

14 Particular crossings

The simpler model is clearly better. So we fix on LOGA using TTBL_SPD:AADT only. This raises the question of why these crossings are predicted differently by the LOGR model chosen. Only one of these, Steele Lane, is predicted lower by LOGR than WABS.

wh=glm.spdxaadt_data[which(abs(glm.spdxaadt_data$gap) >= 0.05),"RANK"]
mutate (dataf, L = glm.spdxaadt_data$L, 
                  gap = glm.spdxaadt_data$gap  )[ wh, ] %>%
  mutate ( name = substr(paste0(CITY, "; ",substr(ROAD, 1, 16)),1,30), W = round(PRED_COLLS,4) ) %>%
  select ( RANK, name, W, L, gap, C, TTBL_SPD, AADT, TOT_TRN, HWYLNS, TOT_TRK ) %>%
  arrange (desc ( gap)) %>% gt

RANK	name	W	L	gap	TTBL_SPD	AADT	TOT_TRN	HWYLNS	TOT_TRK
5	SANTAROSA; PINER_ROAD	0.0485	0.1964	0.1479	79	23900	64	4	1
4	SANTAROSA; COLLEGE_AVENUE	0.0494	0.1697	0.1203	60	26900	64	4	2
16	PETALUMA; CORONA_ROAD	0.0371	0.1189	0.0818	79	13810	60	2	1
31	SONOMA; SEARSPOINT/ST-37	0.0212	0.0994	0.0782	25	35600	6	2	1
9	SANTAROSA; GUERNEVILLE_ROAD	0.0426	0.1190	0.0764	40	27300	64	2	2
2	SANTAROSA; STEELE_LANE	0.1293	0.0680	-0.0613	70	8100	64	3	1

We note that Sears Point and St 37 crossing actually has 4 lanes of traffic, as a look at Google Maps shows. It seems that the somewhat higher probability imputed by L might be justified, although we cannot support inclusion of HWYLNS in the LOGR model.

Similarly Guerneville Rd has 4 highway lanes, though the database shows only 2. It has lower TTBL_SPD than College Ave, but higher AADT by a little. This would make it most similar to College Ave. We would expect a prediction higher for College Ave, and both methods get that, though the WABS difference is minimal.

15 Confidence intervals for predicted probabilities

We calculate 95% confidence intervals for the predictions from the interactions model to show how precise the predictions should be thought. We would expect that the WABS predictions should be within the 95% confidence interval for the LOGR predictions. Otherwise the WABS prediction could be anomalous. We do not have information on confidence intervals for the WABS predictions to see if they overlap those for LOGR.

pr = predict(glm.spdxaadt, type="response", se.fit=TRUE)
confint = round(data.frame(W = dataf$PRED_COLLS,
                           L = pr$fit, 
                           gap=glm.spdxaadt_data$gap,
                           lwr= pr$fit - 1.96*pr$se.fit, 
                           upr = pr$fit + 1.96 * pr$se.fit),4 
                ) %>% 
  mutate ( clen = upr - lwr, 
           rank=dataf$RANK, 
           name=paste0(dataf$CITY,"; ",substr(dataf$ROAD,1,20) ) )
confint[wh,]

##         W      L     gap    lwr    upr   clen rank                        name
## 2  0.1293 0.0680 -0.0613 0.0141 0.1218 0.1077    2      SANTAROSA; STEELE_LANE
## 4  0.0494 0.1697  0.1203 0.0147 0.3247 0.3100    4   SANTAROSA; COLLEGE_AVENUE
## 5  0.0485 0.1964  0.1479 0.0122 0.3806 0.3684    5       SANTAROSA; PINER_ROAD
## 9  0.0426 0.1190  0.0764 0.0181 0.2199 0.2018    9 SANTAROSA; GUERNEVILLE_ROAD
## 16 0.0371 0.1189  0.0818 0.0181 0.2197 0.2016   16       PETALUMA; CORONA_ROAD
## 31 0.0212 0.0994  0.0782 0.0181 0.1806 0.1625   31    SONOMA; SEARSPOINT/ST-37

Only for Steele Lane does the WABS prediction fall outside the 95% confidence limits for the L prediction. What is special about Steele Lane? A picture of the intersection from Google Maps is shown here: Picture .

While there are only 3 lanes of traffic, the inner lane is a left turn lane into Coffey which crosses the tracks. This perhaps encourages cars waiting for the left turn, and not wanting to miss the red light cycle at Coffee, to chance waiting in the crossing for the turn. With train timetable speeds of 70 here, and no knowledge of the light cycle, trains might not be able to stop; drivers, not knowing this, would be taken unawares by a train that could not stop. The confidence limits give us additional support for the LOGR model, though not necessarily in preference to the WABS model.

16 Conclusions

We have independently modeled the WABS data for the cycle ending in November 2018, and have obtained a logistic regression model. It predicts the probability of a collision at each crossing during a one year period by using two of the predictors in the WABS data set, TTBL_SPD and AADT. These are the timetable speed of the scheduled trains, and the average auto daily traffic at the crossing. The variables TOT_TRN and HWYLNS, the total number of trains per day, and the number of highway lanes at the crossing were found not to improve the model, as was also true of the discrete crossing type variable WD.

The model obtained regresses C upon the interaction term TTBL_SPD*AADT alone. Thus there is a complex relation between the predictors.

This model which we call the LOGR model predicts probabilities of collision at all 80 intersections. Only 6 crossings have predicted probabilities greater than 0.05 or 1 in 20 chance that observing the crossing for a year will detect a collision.

An analysis of these crossings shows that of the 6, two have erroneous data in the HWYLNS field in WABS data. This was discovered by inspecting the crossing via Google Maps interactive view.

However these errata would not impact the predictions made by the LOGA model. We do not know whether the WABS predictions took the erroneous values into account.

17 Predicting the number of accidents in a year.

Assume that at crossings accident rates are independent; this is likely on a long linear line such as SMART. In that case we might add the collision probabilities to obtain the number of accidents in a year for the whole line.

The WABS method predicts almost 1.84 accidents per year; the LOGR method predicts 3 accidents per year. Last year (2018) there were 3 accidents at crossings on SMART. We could predict that there would be 3 again this year, if no changes are made at crossings.

 sapply (list(glm.spdxaadt_data$W, glm.spdxaadt_data$L), sum)

## [1] 1.8368 3.0005

18 Acknowledgements

The author thanks Steve Birdlebough for calling this interesting problem and the WABS dataset to his attention.

19 References

[1] WABS dataset

20 Appendix A Univariate logistic regressions

First we try the most significant variable,TTBL_SPD. A plot of the distribution of the gaps will show the range. These gaps appear distributed bimodal and skewed right(?) which means slightly higher estimates than WABS. But it is a pretty good result with no gap larger than about .075 in either direction.

Next we try TOT_TRN, the next most significant variable. We plot the distribution of gaps.

This variable skews gaps to the right, estimating in excess of the WABS estimates. But a look at the predicted values shows a certain amount of discretization, since the TOT_TRN predictor has only a few discrete values. This makes it not a good predictor to include.

Finally we try AADT on its own. We plot the distribution of the gaps.

Again there are relatively few values of the prediction, they tend to be low, and the busy intersections tend to be much lower.

The plots appear below.

list(g1,g2,g3) %>% wrap_plots(nrow=2, ncol=2)

Comparisons of WABS and single-predictor LOGR predictions

21 Appendix C Neural network prediction

We now use a simple neural network to predict the probabilities of accident at each crossing. We use the nnet routine in the caret package. Such a routine uses the data to fire adjacent nodes in the net at random, based on random numbers generated internally. Because we need to seed the neural network with a random number we cannot be assured we will get the same predictions each time. However, after we establish that the neural network works we can then bootstrap its results to get confidence intervals on the predictions. Start by setting the random number generation seed to a specific repeatable number. For comparison we select the same 3 variables we chose above. Since there are only 3 collisions, there could be difficulty training the neural net.

We use a neural network from the nnet package with feature extraction using PCA.

We can display the predicted probabilities with the WABS and LOGR predictions for comparison.

print(mutate( out, NNET=round(probs$`1`,4 ) ) )

##                               name   WABS    LOGR   NNET
## 1             SANTAROSA; HEARN_AVE 0.1711 0.30700 0.1070
## 2           SANTAROSA; STEELE_LANE 0.1293 0.26425 0.0888
## 3      PETALUMA; WASHINGTON_STREET 0.0504 0.24901 0.0660
## 4        SANTAROSA; COLLEGE_AVENUE 0.0494 0.19704 0.0939
## 5            SANTAROSA; PINER_ROAD 0.0485 0.14239 0.1096
## 6   ROHNERTPARK; GOLF_COURSE_DRIVE 0.0464 0.10823 0.0760
## 7            SANRAFAEL; 4TH_STREET 0.0458 0.09945 0.0728
## 8       SANRAFAEL; N_SANPEDRO_ROAD 0.0455 0.09902 0.0711
## 9      SANTAROSA; GUERNEVILLE_ROAD 0.0426 0.09146 0.0774
## 10         PETALUMA; EAST_D_STREET 0.0424 0.08986 0.0648
## 11           SANTAROSA; 3RD_STREET 0.0423 0.08493 0.0613
## 12      SANTAROSA; SAN_MIGUEL_ROAD 0.0412 0.08308 0.0945
## 13           SANTAROSA; BARHAM_AVE 0.0400 0.08175 0.0942
## 14               NOVATO; GRANT_AVE 0.0379 0.08109 0.0733
## 15               NOVATO; OLIVE_AVE 0.0377 0.07851 0.0673
## 16           PETALUMA; CORONA_ROAD 0.0371 0.07651 0.1027
## 17            SANTAROSA; TODD_ROAD 0.0367 0.07650 0.0736
## 18       SANRAFAEL; MISSION_AVENUE 0.0364 0.07406 0.0626
## 19      SANRAFAEL; CIVIC_CENTER_DR 0.0358 0.06865 0.0628
## 20           SANRAFAEL; 5TH_AVENUE 0.0354 0.06430 0.0620
## 21       PETALUMA; W_PAYRAN_STREET 0.0343 0.06325 0.0746
## 22   SANTAROSA; BELLEVUE/CORBY_AVE 0.0332 0.06120 0.0961
## 23      SANTAROSA; SEBASTOPOL_ROAD 0.0325 0.06070 0.0792
## 24       SANRAFAEL; SMITH_RANCH_RD 0.0324 0.06020 0.0734
## 25      PETALUMA; LAKEVILLE_STREET 0.0313 0.05873 0.0698
## 26           SANTAROSA; 9TH_STREET 0.0313 0.05729 0.0687
## 27            NOVATO; ROBLAR_DRIVE 0.0240 0.05547 0.0774
## 28        PETALUMA; NMCDOWELL_BLVD 0.0236 0.02283 0.0771
## 29         NOVATO; RUSHCREEK_PLACE 0.0235 0.02168 0.0683
## 30          SANRAFAEL; IRWINSTREET 0.0231 0.01990 0.0672
## 31        SONOMA; SEARSPOINT/ST-37 0.0212 0.01534 0.0849
## 32       SANRAFAEL; PACHECO_STREET 0.0200 0.01358 0.0671
## 33           SANTAROSA; 8TH_STREET 0.0190 0.01011 0.0587
## 34           SANTAROSA; 7TH_STREET 0.0180 0.01011 0.0586
## 35           SANTAROSA; 6TH_STREET 0.0180 0.00828 0.0586
## 36        PETALUMA; CAULFIELD_LANE 0.0172 0.00765 0.0580
## 37         COTATI; EAST_COTATI_AVE 0.0169 0.00742 0.0778
## 38     ROHNERTPARK; SOUTHWEST_BLVD 0.0163 0.00703 0.0771
## 39  ROHNERTPARK; ROHNERT_PARK_EXPY 0.0159 0.00378 0.0745
## 40              FULTON; RIVER_ROAD 0.0158 0.00369 0.0781
## 41      HEALDSBURG; HEALDSBURG_AVE 0.0149 0.00273 0.0689
## 42           NOVATO; HAMILTON_PKWY 0.0145 0.00234 0.0767
## 43           SANRAFAEL; 2ND_STREET 0.0141 0.00151 0.0753
## 44          NOVATO; HANNA_RANCH_RD 0.0136 0.00144 0.0740
## 45         SONOMA; SR_ROUTE_12/121 0.0129 0.00137 0.0663
## 46        HEALDSBURG; FRONT_STREET 0.0128 0.00131 0.0658
## 47     PENNGROVE; MAIN_ST/WOODWARD 0.0126 0.00114 0.0713
## 48       WINDSOR; WINDSOR_RIVER_RD 0.0126 0.00043 0.0753
## 49       SANRAFAEL; FRANCISCO_BLVD 0.0124 0.00029 0.0606
## 50       PENNGROVE; OLD_ADOBE_ROAD 0.0117 0.00029 0.0710
## 51         SONOMA; SR_ROUTE_12/121 0.0117 0.00029 0.0676
## 52           SANRAFAEL; 3RD_STREET 0.0115 0.00023 0.0715
## 53             FULTON; FULTON_ROAD 0.0114 0.00022 0.0747
## 54       HEALDSBURG; BAILHACHE_AVE 0.0112 0.00017 0.0740
## 55       COTATI; EAST_RAILROAD_AVE 0.0108 0.00017 0.0739
## 56      SANTAROSA; WEST_ROBLES_AVE 0.0099 0.00015 0.0737
## 57            PETALUMA; ELY_ROAD_N 0.0096 0.00013 0.0705
## 58         ROHNERTPARK; SCENIC_AVE 0.0093 0.00009 0.0736
## 59             WINDSOR; STARR_ROAD 0.0091 0.00007 0.0740
## 60             FULTON; SHILOH_ROAD 0.0085 0.00007 0.0739
## 61       HEALDSBURG; LIMERICK_LANE 0.0082 0.00004 0.0734
## 62            FULTON; AIRPORT_BLVD 0.0081 0.00004 0.0739
## 63             SANRAFAEL; PED_XING 0.0081 0.00004 0.0672
## 64         PETALUMA; HOPPER_STREET 0.0078 0.00004 0.0624
## 65               NAPA; MILTON_ROAD 0.0077 0.00003 0.0597
## 66          SANRAFAEL; RICE_STREET 0.0075 0.00003 0.0606
## 67        SONOMA; SKAGGS_ISLAND_RD 0.0068 0.00002 0.0634
## 68          HEALDSBURG; GRANT_AVE. 0.0064 0.00000 0.0732
## 69          WINDSOR; MITCHELL_LANE 0.0060 0.00000 0.0736
## 70       SONOMA; TOULAY_CREEK_ROAD 0.0052 0.00000 0.0654
## 71           FULTON; AVIATION_BLVD 0.0039 0.00000 0.0752
## 72         SANTAROSA; STANDISH_AVE 0.0038 0.00000 0.0597
## 73        SANRAFAEL; PALOMA_AVENUE 0.0016 0.00000 0.0671
## 74   HEALDSBURG; GRANT_SCHOOL_ROAD 0.0003 0.00000 0.0663
## 75      PETALUMA; SOUTH_POINT_ROAD 0.0003 0.00000 0.0663
## 76                NOVATO; PED_XING 0.0002 0.00000 0.0699
## 77    NOVATO; Franklin_Ped_crossin 0.0001 0.00000 0.0739
## 78 SANTAROSA; Railroad_square_ped_ 0.0001 0.00000 0.0605
## 79           NOVATO; BIKE_PATH_PED 0.0001 0.00000 0.0764
## 80           ROHNERTPARK; PED_XING 0.0001 0.00000 0.0569

We compare the neural network predictions with those of LOGR graphically.

mutate( out, NNET=round(probs$`1`,4 ) )  %>% ggplot()+
  geom_point(aes(WABS, NNET, color="NNET")) +
  geom_point(aes(WABS, LOGR, color="LOGR")) +
  scale_color_manual(values=c("blue","orange"), name="Model")+
  geom_abline(aes(intercept=0, slope=1), linetype="dotted") +
  labs(y="Probability, NNET or LOGR")

For zero or low WABS predictions, LOGR is right in line, but above .025 WABS predicts lower; NNET predicts uniformly higher except for the two cases above a WABS prediction of 10.

A neural net with feature extraction seems like it is not as useful as our LOGA model.

WABS Method to estimate prob of collision

Bruce Hartman

February 1, 2019; updated March 31, 2025

1 Introduction

2 Data

3 Organizing Data

4 Data preview

5 Logistic regression analysis

6 Reduced form of logistic regression

7 Predicting probabilities of collision

8 Importance of variables

9 Residuals

10 Comparing with the WABS predictions

11 Predictions using interaction terms

12 Improving on LOGA

13 Extending model with `TOT_TRN`

14 Particular crossings

15 Confidence intervals for predicted probabilities

16 Conclusions

17 Predicting the number of accidents in a year.

18 Acknowledgements

19 References

20 Appendix A Univariate logistic regressions

21 Appendix C Neural network prediction

WABS Method to estimate prob of collision

Bruce Hartman

February 1, 2019; updated March 31, 2025

1 Introduction

2 Data

3 Organizing Data

4 Data preview

5 Logistic regression analysis

6 Reduced form of logistic regression

7 Predicting probabilities of collision

8 Importance of variables

9 Residuals

10 Comparing with the WABS predictions

11 Predictions using interaction terms

12 Improving on LOGA

13 Extending model with TOT_TRN

14 Particular crossings

15 Confidence intervals for predicted probabilities

16 Conclusions

17 Predicting the number of accidents in a year.

18 Acknowledgements

19 References

20 Appendix A Univariate logistic regressions

21 Appendix C Neural network prediction

13 Extending model with `TOT_TRN`