Neural network diary #7: Converting data to usable form

Now that we have the pairs built it is time to look at the inputs that I am going to use for the network. Let’s start with the main method for dataline and then go through the components in more details.

require_relative 'schema'
require_relative 'helpers'
require_relative 'netbuilder'
require 'ruby-progressbar'

# Select the pairs that are used for learning 
pairs = Pair.where(:race_id => Run.where(:dataset => "learning").where.not(:draw => 0).pluck(:race_id).uniq)

# Initialize progressbar and required variables
@all = pairs.size
@progressbar = ProgressBar.create
@progress = 0

# Loop through all pairs and check for progress
pairs.each_with_index do |pair, index|
  build_dataline(pair)
  progress?(index)
end

@progressbar.finish

As you can see I did the progressbar a bit differently this time, I made variables required by it to be instance variables so that they can be accessed by other methods as well. Also, I am excluding pairs that include a runner that has a draw of zero. This is because there was an error in my data source and draws were missing from races ran in the first half of 2013. Unfortunately they were not in the original imports from data supplier to Racing Dossier either so I need to find an alternate source for those draws as I don’t want to finalize my network training until I have all of the data. Other than those two comments the code is pretty self-explanatory. We choose the pairs and loop through them.

def build_dataline(pair)
  # Select the runs and build the dataline
  inside = Run.find(pair.inside_runner)
  outside = Run.find(pair.outside_runner)
  dataline = dataline_normal(inside, outside)
  pair.input = dataline
  pair.output = determine_output(inside.position, outside.position)
  pair.save!
end

Now that I look at the code above I realize that there is an extra line, I don’t need to first place the dataline into variable dataline before setting pair.input field to hold the contents. Oh well, I guess I can refactor as I go.

def dataline_normal(inside, outside)
  # Declare the array
  dataline = []
  # Convert distance in yards to number less than 1.0, one input
  dataline << inside.distance / 10000.0
  # Check if race is handicap or not, two inputs
  # Handicap = [1,0], non-handicap = [0,1]
  dataline << determine_handicap(inside.handicap)
  # Convert going to thermometer style input, five inputs
  # Slow = [1,0,0,0,0] -> Fast = [1,1,1,1,1]
  dataline << determine_going(inside.going)
  # Check how far away horses are from each other and scale to number less than 1.0, one input
  dataline << diff(inside.draw, outside.draw, 100)
  # Check difference of spdfiglr, one input
  dataline << diff(inside.spdfiglr, outside.spdfiglr, 100)
  # Check difference of shorpro, one input
  dataline << diff(inside.shorpro, outside.shorpro, 100)
  # Check difference of pfp, one input
  dataline << diff(inside.pfp, outside.pfp, 100)
  # Check difference of shoravd, one input
  dataline << diff(inside.shoravd, outside.shoravd, 100)
  # Check difference of raiform, one input
  dataline << diff(inside.raiform, outside.raiform, 1000)
  # Check if runners are moving up or down in money class, four inputs
  # No movement = [0,0,0,0] Inside up, outside down = [1,0,0,1]
  dataline << determine_mcls(inside.mclslr, outside.mclslr)
  # Check difference of acecl, one input
  dataline << diff(inside.acecl, outside.acecl, 1)
  # Check if either runner is a course and/or distance winner or same race winner, eight inputs
  # [c, d, cd, cds] + [c, d, cd, cds]
  dataline << determine_cdwinner(inside.cdwinner, outside.cdwinner)
  # Remove any sub arrays from dataline array
  dataline.flatten.join(",")
end

As much comments as there is code 🙂 Basically I create the inputs one by one by either calculating the difference between the two ratings as explained in an earlier post or creating an array of ones and zeros and appending the results into array dataline. Below you can see the helper methods used in the above piece of code.

def determine_handicap(handicap)
  o = Array.new(2,0)
  if handicap || handicap == "true"
    o[0] = 1
  else
    o[1] = 1
  end
  o
end

def determine_going(going)
  arr = Array.new(5, 0)
  goings = ["Slow","Standard To Slow","Standard","Standard To Fast","Fast"]
  (0..goings.index(going)).each do |i|
    arr[i] = 1
  end
  arr
end

def diff(inside, outside, divisor)
  o = begin (inside - outside) / divisor.to_f rescue 0 end
  if o > 1
    1
  elsif o < -1
    -1
  else
    o
  end
end

def determine_mcls(inside, outside)
  o = Array.new(4,0)
  unless inside.nil?
    inside > 1.07 ? o[0] = 1 : 0
    inside < 0.93 ? o[1] = 1 : 0
  end
  unless outside.nil?
    outside > 1.07 ? o[2] = 1 : 0
    outside < 0.93 ? o[3] = 1 : 0
  end
  o
end

def determine_cdwinner(inside, outside)
  o = Array.new(8,0)
  inside > 0 ? o[inside - 1] = 1 : 0
  outside > 0 ? o[outside + 3] = 1 : 0
  o
end

def determine_output(inside, outside)
  if inside < outside
    1
  else
    -1
  end
end

def progress?(index)
  # Check if there is progress made
  if @progress < (index.to_f / @all * 100).round
    @progressbar.increment
    @progress += 1
  end
end

These are pretty simple stuff, basically some calculations that I prefer not to repeat and / or is nice to keep out from the main blocks of code. That was it for this time. In the next post we actually start looking at the code used in training the network or at least the code were I tell FANN to learn 🙂

Neural Network Diary #6: Building the pairs

Now that we data split into different datasets it is dueling time. As the plan is to look at each race as a bunch of duels between two horses we need to do this pairing up and now that I have the database it makes sense to save this information as well so there is no need to do this pairing up each time the pairs are needed for network teaching purposes.

I am not going to pair each horse in a race with all others but just those that were really challenging the win against all others. And as a criteria for challenging the win I use finishing within three lengths of the winner. This means that that complete database of 20 thousand runs transforms into database of 70 thousand pairs and slightly more respectable dataset for teaching the network.

I started by creating a new table in my database called pairs which has five fields, race_id to refer to race in question and then id of inside runner (the one with lower draw) and id of outside runner. Remaining two fields were reserved for input and output lines which I will go through in the next post.

Code for doing the pairing is pretty simple and I have tried to explain it in the comments below. I have used a completely optional gem for progress bar here to give me some indication on if there is anything happening when running the code.

def create_pairs
  # Initialize the progress bar
  progressbar = ProgressBar.create
  progress = 0
  # Get ids for all races to through them one at a time
  race_ids = Run.all.pluck(:race_id).uniq
  race_ids.each_with_index do |race_id, index|
    # Select the top contenders in a race
    top3 = Run.where(:race_id => race_id, :distance_to_winner => 0..3)
    top3.each do |top|
      # Select all other horses from the race
      opponents = Run.where(:race_id => race_id).where.not(:run_id => top.run_id)
      opponents.each do |opponent|
        # Determine which horse is on inside and which is running outside
        inside = 0
        outside = 0
        if top.draw < opponent.draw
          inside = top.run_id
          outside = opponent.run_id
        else
          inside = opponent.run_id
          outside = top.run_id
        end
        # Create the pair in the database and save it
        pair = Pair.create(:race_id => race_id, :inside_runner => inside, :outside_runner => outside)
        pair.save!
      end
    end
    # Check if progress has moved one percentage point and increment the bar if so
    if progress?(index, progress, all)
      progressbar.increment
      progress += 1
    end
  end
end

def progress?(index, progress, all)
  # Check if there is progress made
  progress < (index.to_f / all * 100).round ? true : false
end

Running this code will create the pairs and save them do database for using later on. And you will notice that I created pairs for all races and not just those that are put in to learning-dataset. I did this with the idea that I might shuffle the data at some point and I dont’t want to have the need for building the pairs again at that possible point in time.

That was all there is this time around and in the next post I am going to build the dataline used as input for one pair.

Neural Network Diary #4: Some more thoughts about inputs and data

Last time I was thinking about how handle the negative values possible in the speed ratings provided by Racing Dossier. Luckily that is not an issue, it is just a matter of using a activation function that supports values -1 to 1. Activation functions available in FANN can be seen here and the ones to use in my case are either

FANN_SIGMOID_SYMMETRIC

 

Symmetric sigmoid activation function, AKA tanh. One of the most used activation functions.

This activation function gives output that is between -1 and 1.

or

FANN_SIGMOID_SYMMETRIC_STEPWISE

 

Stepwise linear approximation to symmetric sigmoid. Faster than symmetric sigmoid but a bit less precise.

This activation function gives output that is between -1 and 1.

And from those I am going to start with the first one. When thinking about this I also had a new idea on how to handle the  presentation of the values. Initially I was planning on using normalised values and two fields, one for each runner. Then I just thought about using the actual values and adjusting them to be between 0 and 1 (or -1 and 1). And now the current idea is that I am going to use only field for each rating and calculate the difference between the ratings there and also using one field for networks output where 1 is when inside horse came ahead and -1 when outside horse came ahead.

Datawise, I have the dataset for the races that I am going to use in development. From 1st of June 2012 to 31st of May 2015. I did exclude maidens and selling or claiming races but have included both handicaps and non handicaps. And as I am concentrating on races ran over lengths less than 8 furlongs I had total of almost 22 000 runs worth of data to use. Next up is dividing them evenly into learning, testing and unseen datasets. So that all courses and all distances are evenly represented in all datasets.

 

Neural Network Diary #3: Thoughts about inputs and ratings

Recently I have been thinking about inputs that would use in the neural network and as mentioned earlier, most will come from Racing Dossier-service. I don’t wan’t to include too many but then again not too few either. Currently I am planning to include following list of ratings.

  • Shorpro – Projected speed rating in todays race
  • SpdfigLR – Speed rating in last race
  • SHorAvD – Average speed rating at todays race distance
  • PFP – Current form class level of horse, this rating starts at 1500
  • MClSLr – Money Class Shift From Last Race. Prize money of todays race divided by prize money of last race. Anything greater than 1.07 is a shift up in class, anything less than .93 is a drop in class.
  • Raiform – Rating assessing last three races
  • Course, Distance or Course/Distance winner

I am still thinking that I might add something measuring how succesfull horse has been when it comes to pricemoney.

Originally I was planning on normalising ratings but that was before I came up with that list and now that I think of it, I might just as well use them as they are and dividing with suitably big number to bring them to less than one. Money Class shift and Course/Distance winner I am putting in as boolean values.

Only problem with that is the fact that speed figures above can be less than zero, I need to find a way to handle that.

Neural Network Diary #2: Tools & Data

Before we get to actually build the neural network I am going to go through the tools that I am planning to use during the project. This list is obviously subject to change but this is what I feel at this point that I will need to complete this.

I will need to do a fair bit of modifying of data and for that I am using Ruby. Naturally one can use any programming language they wish but I am most familiar with Ruby and I like how readable and natural language like the scripts are. When it is relevant I am going to post the code or at least snippets of it in the blog as well. If you are new to Ruby it might be worthwhile to look at this quick start at Ruby official site or this pretty throughout tutorial at Tutorials Point. in the end though, what is needed is pretty simple and beginner level stuff, some calculations and loops mostly.

One could build the Neural Network software from ground up, but I am going to rely on existing library for this purpose. Earlier I have been using AI4R but as I mentioned in my post telling about new version of Raiform I have moved on to FANN or Fast Artificial Neural Network. It seems to be doing a bit better job even with same kind of network topology but what I especially like is feature called Cascade2. It dynamically builds and trains the topology and that is what I used to build the network for Raiform 2.0.

Neural networks are a pretty advanced topic and while it does help if you understand how they work it is till possible to utilize them even if most of the underlying math is left untouched. FANN has Ruby bindings (In addition to several other languages) and I am using Ruby gem called ruby-fann to take advantage of it. FANN has several graphical interfaces as well but I find it a lot easier to work in command line (Command line in windows is pain to work with so be warned or use a proper OS like Linux 🙂 ). If you wish to get a primer about Neural networks you could read for example this.

Last big building block is data. I am going to use data starting from beginning of 2012 and all of my data is originated from Racing Dossier. I have the data in a database so it is easy for me to fetch data with required filters as needed. Actual ratings that I am planning to use I will cover later on. I haven’t decided yet, but it might make sense to build a working database to handle the training and testing data. In the past I have just used csv files for this purpose.

 

1 2