Neural Network Diary #8: Getting back in the saddle

Rather lengthy hiatus, sorry about that. Moving to another country and day job requirements has throttled down my betting related activities to bare minimum. Now I hope that I have a little bit more time to invest in this project in the near future.

Let’s get started from where we left of at the end of previous post and start looking at the code I use to instruct FANN to learn. As FANN does most of the heavy lifting we only need to tell it what data to use and how to handle it.

  # input_qty = number of input nodes
  # output_qty = number of input nodes
  # savefile = name of the file network is saved to
  # neurons = how many neurons will be trained by cascade function

def train_cascade(input_qty, output_qty, savefile, neurons)
  # Create the basis for the network, define number of inputs and outputs
  net = RubyFann::Shortcut.new(:num_inputs=> input_qty, :num_outputs=> output_qty)

  # As our inputs and outputs can have a value of -1 to 1 we can only use sigmoid_symmetric
  # and that needs to be specified
  net.set_cascade_activation_functions([:sigmoid_symmetric])
  net.set_activation_function_output(:sigmoid_symmetric)
  
  # Search for pairs to be used for training
  pairs = Pair.where(:race_id => Run.where(:dataset => "learning").where.not(:draw => nil).pluck(:race_id).uniq).where.not(:input => nil)
  
  # I have saved precalculated inputs to the datase as one field, thus it needs splitting
  # before they are usable 
  inputs = []
  pairs.each do |pair|
    inputs << pair.input.split(",").map(&:to_f)
  end

  # And same done to outputs. Output also needs to be an array, even if it is only one value
  outputs = []
  pairs.pluck(:output).each do |o|
    outputs << Array.new(1,o)
  end

  # Once we have arrays of inputs and outputs we can combine them into form that FANN
  # can understand
  train = RubyFann::TrainData.new(:inputs => inputs, :desired_outputs => outputs)
  
  # Finally it is time for some training. This will take a while, depending naturally
  # on how much data one is using.
  net.cascadetrain_on_data(train, neurons, 1, 0.05)

  # After training it is important to remember to save the network into a file
  # for further use
  net.save(savefile)
end

That is pretty simple isn’t it?

Obviously it does need a single line in some another file to actually call this function. Now we can proceed to training and finding out if there actually is something at the end of this exercise.

Neural network diary #7: Converting data to usable form

Now that we have the pairs built it is time to look at the inputs that I am going to use for the network. Let’s start with the main method for dataline and then go through the components in more details.

require_relative 'schema'
require_relative 'helpers'
require_relative 'netbuilder'
require 'ruby-progressbar'

# Select the pairs that are used for learning 
pairs = Pair.where(:race_id => Run.where(:dataset => "learning").where.not(:draw => 0).pluck(:race_id).uniq)

# Initialize progressbar and required variables
@all = pairs.size
@progressbar = ProgressBar.create
@progress = 0

# Loop through all pairs and check for progress
pairs.each_with_index do |pair, index|
  build_dataline(pair)
  progress?(index)
end

@progressbar.finish

As you can see I did the progressbar a bit differently this time, I made variables required by it to be instance variables so that they can be accessed by other methods as well. Also, I am excluding pairs that include a runner that has a draw of zero. This is because there was an error in my data source and draws were missing from races ran in the first half of 2013. Unfortunately they were not in the original imports from data supplier to Racing Dossier either so I need to find an alternate source for those draws as I don’t want to finalize my network training until I have all of the data. Other than those two comments the code is pretty self-explanatory. We choose the pairs and loop through them.

def build_dataline(pair)
  # Select the runs and build the dataline
  inside = Run.find(pair.inside_runner)
  outside = Run.find(pair.outside_runner)
  dataline = dataline_normal(inside, outside)
  pair.input = dataline
  pair.output = determine_output(inside.position, outside.position)
  pair.save!
end

Now that I look at the code above I realize that there is an extra line, I don’t need to first place the dataline into variable dataline before setting pair.input field to hold the contents. Oh well, I guess I can refactor as I go.

def dataline_normal(inside, outside)
  # Declare the array
  dataline = []
  # Convert distance in yards to number less than 1.0, one input
  dataline << inside.distance / 10000.0
  # Check if race is handicap or not, two inputs
  # Handicap = [1,0], non-handicap = [0,1]
  dataline << determine_handicap(inside.handicap)
  # Convert going to thermometer style input, five inputs
  # Slow = [1,0,0,0,0] -> Fast = [1,1,1,1,1]
  dataline << determine_going(inside.going)
  # Check how far away horses are from each other and scale to number less than 1.0, one input
  dataline << diff(inside.draw, outside.draw, 100)
  # Check difference of spdfiglr, one input
  dataline << diff(inside.spdfiglr, outside.spdfiglr, 100)
  # Check difference of shorpro, one input
  dataline << diff(inside.shorpro, outside.shorpro, 100)
  # Check difference of pfp, one input
  dataline << diff(inside.pfp, outside.pfp, 100)
  # Check difference of shoravd, one input
  dataline << diff(inside.shoravd, outside.shoravd, 100)
  # Check difference of raiform, one input
  dataline << diff(inside.raiform, outside.raiform, 1000)
  # Check if runners are moving up or down in money class, four inputs
  # No movement = [0,0,0,0] Inside up, outside down = [1,0,0,1]
  dataline << determine_mcls(inside.mclslr, outside.mclslr)
  # Check difference of acecl, one input
  dataline << diff(inside.acecl, outside.acecl, 1)
  # Check if either runner is a course and/or distance winner or same race winner, eight inputs
  # [c, d, cd, cds] + [c, d, cd, cds]
  dataline << determine_cdwinner(inside.cdwinner, outside.cdwinner)
  # Remove any sub arrays from dataline array
  dataline.flatten.join(",")
end

As much comments as there is code 🙂 Basically I create the inputs one by one by either calculating the difference between the two ratings as explained in an earlier post or creating an array of ones and zeros and appending the results into array dataline. Below you can see the helper methods used in the above piece of code.

def determine_handicap(handicap)
  o = Array.new(2,0)
  if handicap || handicap == "true"
    o[0] = 1
  else
    o[1] = 1
  end
  o
end

def determine_going(going)
  arr = Array.new(5, 0)
  goings = ["Slow","Standard To Slow","Standard","Standard To Fast","Fast"]
  (0..goings.index(going)).each do |i|
    arr[i] = 1
  end
  arr
end

def diff(inside, outside, divisor)
  o = begin (inside - outside) / divisor.to_f rescue 0 end
  if o > 1
    1
  elsif o < -1
    -1
  else
    o
  end
end

def determine_mcls(inside, outside)
  o = Array.new(4,0)
  unless inside.nil?
    inside > 1.07 ? o[0] = 1 : 0
    inside < 0.93 ? o[1] = 1 : 0
  end
  unless outside.nil?
    outside > 1.07 ? o[2] = 1 : 0
    outside < 0.93 ? o[3] = 1 : 0
  end
  o
end

def determine_cdwinner(inside, outside)
  o = Array.new(8,0)
  inside > 0 ? o[inside - 1] = 1 : 0
  outside > 0 ? o[outside + 3] = 1 : 0
  o
end

def determine_output(inside, outside)
  if inside < outside
    1
  else
    -1
  end
end

def progress?(index)
  # Check if there is progress made
  if @progress < (index.to_f / @all * 100).round
    @progressbar.increment
    @progress += 1
  end
end

These are pretty simple stuff, basically some calculations that I prefer not to repeat and / or is nice to keep out from the main blocks of code. That was it for this time. In the next post we actually start looking at the code used in training the network or at least the code were I tell FANN to learn 🙂

Neural Network Diary #6: Building the pairs

Now that we data split into different datasets it is dueling time. As the plan is to look at each race as a bunch of duels between two horses we need to do this pairing up and now that I have the database it makes sense to save this information as well so there is no need to do this pairing up each time the pairs are needed for network teaching purposes.

I am not going to pair each horse in a race with all others but just those that were really challenging the win against all others. And as a criteria for challenging the win I use finishing within three lengths of the winner. This means that that complete database of 20 thousand runs transforms into database of 70 thousand pairs and slightly more respectable dataset for teaching the network.

I started by creating a new table in my database called pairs which has five fields, race_id to refer to race in question and then id of inside runner (the one with lower draw) and id of outside runner. Remaining two fields were reserved for input and output lines which I will go through in the next post.

Code for doing the pairing is pretty simple and I have tried to explain it in the comments below. I have used a completely optional gem for progress bar here to give me some indication on if there is anything happening when running the code.

def create_pairs
  # Initialize the progress bar
  progressbar = ProgressBar.create
  progress = 0
  # Get ids for all races to through them one at a time
  race_ids = Run.all.pluck(:race_id).uniq
  race_ids.each_with_index do |race_id, index|
    # Select the top contenders in a race
    top3 = Run.where(:race_id => race_id, :distance_to_winner => 0..3)
    top3.each do |top|
      # Select all other horses from the race
      opponents = Run.where(:race_id => race_id).where.not(:run_id => top.run_id)
      opponents.each do |opponent|
        # Determine which horse is on inside and which is running outside
        inside = 0
        outside = 0
        if top.draw < opponent.draw
          inside = top.run_id
          outside = opponent.run_id
        else
          inside = opponent.run_id
          outside = top.run_id
        end
        # Create the pair in the database and save it
        pair = Pair.create(:race_id => race_id, :inside_runner => inside, :outside_runner => outside)
        pair.save!
      end
    end
    # Check if progress has moved one percentage point and increment the bar if so
    if progress?(index, progress, all)
      progressbar.increment
      progress += 1
    end
  end
end

def progress?(index, progress, all)
  # Check if there is progress made
  progress < (index.to_f / all * 100).round ? true : false
end

Running this code will create the pairs and save them do database for using later on. And you will notice that I created pairs for all races and not just those that are put in to learning-dataset. I did this with the idea that I might shuffle the data at some point and I dont’t want to have the need for building the pairs again at that possible point in time.

That was all there is this time around and in the next post I am going to build the dataline used as input for one pair.

Neural Network Diary #5: Splitting the datasets

As I initially thought I ended up with database route for handling the data. As there is relatively moderate amount of data I decided to use Sqlite database. It is really convenient simple database where data is stored in only one file and it doesn’t require heavy background processes. To make edits to the database I am using Sqlitebrowser, really handy app to build the tables as well as browse the data.

Creating the database and uploading the data is just a matter of choosing the save location for the database file and then importing csv-file as a table. Takes only few minutes (naturally depending on the amount of data).

I am really bad in writing SQL so I tend to use Ruby gem called Active Record. While it is primarily meant to be used with web framework Ruby on Rails it is perfectly usable stand alone as well.

To use use Active Record I am using couple files that I am depending on in my main interaction file. First I have file named dbconfig.rb

require 'rubygems'
require 'active_record'

ActiveRecord::Base.establish_connection(
:adapter => "sqlite3",
:database => "/path/to/database.sqlite3"
)

I require that file in schema.rb which is used to define models that I am using. Though there isn’t much there yet.

require_relative 'dbconfig'

class Run < ActiveRecord::Base

end

And finally the file that I used to divide the data to three datasets, three fifths to learning and fifth to both test and unseen. And roughly evenly distributed between all courses and distances as well as time periods.

require_relative 'schema'
race_ids = Run.all.order(course: :asc, distance: :asc, date: :asc).pluck(:race_id).uniq
i = 1 
race_ids.each do |race_id| 
  runners = Run.where(:race_id => race_id) 
  if i < 4 
    runners.update_all(:dataset => "learning") 
    i += 1 
  elsif i < 5 
    runners.update_all(:dataset => "test") 
    i += 1 
  else 
    runners.update_all(:dataset => "unseen") 
    i = 0 
  end 
end

This is first time me posting code and I am not really sure how much of is beneficial and how much more people would like to read. Naturally there is metric ton of material available on internet if one wants to learn more about Ruby and/or Active Record. But if you have an opinion, please let me know in the comments.

Neural Network Diary #4: Some more thoughts about inputs and data

Last time I was thinking about how handle the negative values possible in the speed ratings provided by Racing Dossier. Luckily that is not an issue, it is just a matter of using a activation function that supports values -1 to 1. Activation functions available in FANN can be seen here and the ones to use in my case are either

FANN_SIGMOID_SYMMETRIC

 

Symmetric sigmoid activation function, AKA tanh. One of the most used activation functions.

This activation function gives output that is between -1 and 1.

or

FANN_SIGMOID_SYMMETRIC_STEPWISE

 

Stepwise linear approximation to symmetric sigmoid. Faster than symmetric sigmoid but a bit less precise.

This activation function gives output that is between -1 and 1.

And from those I am going to start with the first one. When thinking about this I also had a new idea on how to handle the  presentation of the values. Initially I was planning on using normalised values and two fields, one for each runner. Then I just thought about using the actual values and adjusting them to be between 0 and 1 (or -1 and 1). And now the current idea is that I am going to use only field for each rating and calculate the difference between the ratings there and also using one field for networks output where 1 is when inside horse came ahead and -1 when outside horse came ahead.

Datawise, I have the dataset for the races that I am going to use in development. From 1st of June 2012 to 31st of May 2015. I did exclude maidens and selling or claiming races but have included both handicaps and non handicaps. And as I am concentrating on races ran over lengths less than 8 furlongs I had total of almost 22 000 runs worth of data to use. Next up is dividing them evenly into learning, testing and unseen datasets. So that all courses and all distances are evenly represented in all datasets.

 

1 2