I haven’t posted links to other blogs that much but this post at Equinometry.com from the other side of the Atlantic is a good one. Not necessarily ground breaking stuff but acts as a good reminder. Coming from US not everything is readily applicable but enough is in my opinion to warrant a read.
It briefly covers some new aspects in addition to basic form, speed, pace and class and author promises to followup with Wagering 2.0 article as well.
On related note, there were some good thoughts about data in recent article at Geegeez.co.uk.
Last time I was thinking about how handle the negative values possible in the speed ratings provided by Racing Dossier. Luckily that is not an issue, it is just a matter of using a activation function that supports values -1 to 1. Activation functions available in FANN can be seen here and the ones to use in my case are either
Symmetric sigmoid activation function, AKA tanh. One of the most used activation functions.
This activation function gives output that is between -1 and 1.
Stepwise linear approximation to symmetric sigmoid. Faster than symmetric sigmoid but a bit less precise.
This activation function gives output that is between -1 and 1.
And from those I am going to start with the first one. When thinking about this I also had a new idea on how to handle the presentation of the values. Initially I was planning on using normalised values and two fields, one for each runner. Then I just thought about using the actual values and adjusting them to be between 0 and 1 (or -1 and 1). And now the current idea is that I am going to use only field for each rating and calculate the difference between the ratings there and also using one field for networks output where 1 is when inside horse came ahead and -1 when outside horse came ahead.
Datawise, I have the dataset for the races that I am going to use in development. From 1st of June 2012 to 31st of May 2015. I did exclude maidens and selling or claiming races but have included both handicaps and non handicaps. And as I am concentrating on races ran over lengths less than 8 furlongs I had total of almost 22 000 runs worth of data to use. Next up is dividing them evenly into learning, testing and unseen datasets. So that all courses and all distances are evenly represented in all datasets.
Recently I have been thinking about inputs that would use in the neural network and as mentioned earlier, most will come from Racing Dossier-service. I don’t wan’t to include too many but then again not too few either. Currently I am planning to include following list of ratings.
- Shorpro – Projected speed rating in todays race
- SpdfigLR – Speed rating in last race
- SHorAvD – Average speed rating at todays race distance
- PFP – Current form class level of horse, this rating starts at 1500
- MClSLr – Money Class Shift From Last Race. Prize money of todays race divided by prize money of last race. Anything greater than 1.07 is a shift up in class, anything less than .93 is a drop in class.
- Raiform – Rating assessing last three races
- Course, Distance or Course/Distance winner
I am still thinking that I might add something measuring how succesfull horse has been when it comes to pricemoney.
Originally I was planning on normalising ratings but that was before I came up with that list and now that I think of it, I might just as well use them as they are and dividing with suitably big number to bring them to less than one. Money Class shift and Course/Distance winner I am putting in as boolean values.
Only problem with that is the fact that speed figures above can be less than zero, I need to find a way to handle that.
Before we get to actually build the neural network I am going to go through the tools that I am planning to use during the project. This list is obviously subject to change but this is what I feel at this point that I will need to complete this.
I will need to do a fair bit of modifying of data and for that I am using Ruby. Naturally one can use any programming language they wish but I am most familiar with Ruby and I like how readable and natural language like the scripts are. When it is relevant I am going to post the code or at least snippets of it in the blog as well. If you are new to Ruby it might be worthwhile to look at this quick start at Ruby official site or this pretty throughout tutorial at Tutorials Point. in the end though, what is needed is pretty simple and beginner level stuff, some calculations and loops mostly.
One could build the Neural Network software from ground up, but I am going to rely on existing library for this purpose. Earlier I have been using AI4R but as I mentioned in my post telling about new version of Raiform I have moved on to FANN or Fast Artificial Neural Network. It seems to be doing a bit better job even with same kind of network topology but what I especially like is feature called Cascade2. It dynamically builds and trains the topology and that is what I used to build the network for Raiform 2.0.
Neural networks are a pretty advanced topic and while it does help if you understand how they work it is till possible to utilize them even if most of the underlying math is left untouched. FANN has Ruby bindings (In addition to several other languages) and I am using Ruby gem called ruby-fann to take advantage of it. FANN has several graphical interfaces as well but I find it a lot easier to work in command line (Command line in windows is pain to work with so be warned or use a proper OS like Linux 🙂 ). If you wish to get a primer about Neural networks you could read for example this.
Last big building block is data. I am going to use data starting from beginning of 2012 and all of my data is originated from Racing Dossier. I have the data in a database so it is easy for me to fetch data with required filters as needed. Actual ratings that I am planning to use I will cover later on. I haven’t decided yet, but it might make sense to build a working database to handle the training and testing data. In the past I have just used csv files for this purpose.
For a while now I have been planning on combining some ideas that I have used in the past and things that I have wanted to learn more about. And I have decided to write a diary of sorts which would serve a dual purpose of documenting this for my own benefit and potentially acting as a tutorial of sorts for others interested in pursuing similar ends.
My plan is pretty simple. I plan to create a neural network and output of that network would be further adjusted with Monte Carlo simulation. End results of this combination should be most likely winner and likelihood for that so that in addition to selection a value price would be calculated for it as well.
I am going to concentrate on 5-7 furlong All Weather races ran in UK and Ireland. Idea is to structure network in a way that pair of runners is modeled as one row of data (This is lifted from old Smartsig article, reference to which I need to dig up). Winner of future race is predicted then by comparing all pairs in the race and finding out which one wins most of these virtual duels.
This is also where I plan to utilize Monte Carlo simulation, so instead of one run through the network I am going to do it ten thousand times, or whatever figures seems like reasonable for the use when I get to that point.
As I am basically learning by doing here I welcome all comments and suggestions any reader might have.