Neural network diary #7: Converting data to usable form

Now that we have the pairs built it is time to look at the inputs that I am going to use for the network. Let’s start with the main method for dataline and then go through the components in more details.

require_relative 'schema'
require_relative 'helpers'
require_relative 'netbuilder'
require 'ruby-progressbar'

# Select the pairs that are used for learning 
pairs = Pair.where(:race_id => Run.where(:dataset => "learning").where.not(:draw => 0).pluck(:race_id).uniq)

# Initialize progressbar and required variables
@all = pairs.size
@progressbar = ProgressBar.create
@progress = 0

# Loop through all pairs and check for progress
pairs.each_with_index do |pair, index|
  build_dataline(pair)
  progress?(index)
end

@progressbar.finish

As you can see I did the progressbar a bit differently this time, I made variables required by it to be instance variables so that they can be accessed by other methods as well. Also, I am excluding pairs that include a runner that has a draw of zero. This is because there was an error in my data source and draws were missing from races ran in the first half of 2013. Unfortunately they were not in the original imports from data supplier to Racing Dossier either so I need to find an alternate source for those draws as I don’t want to finalize my network training until I have all of the data. Other than those two comments the code is pretty self-explanatory. We choose the pairs and loop through them.

def build_dataline(pair)
  # Select the runs and build the dataline
  inside = Run.find(pair.inside_runner)
  outside = Run.find(pair.outside_runner)
  dataline = dataline_normal(inside, outside)
  pair.input = dataline
  pair.output = determine_output(inside.position, outside.position)
  pair.save!
end

Now that I look at the code above I realize that there is an extra line, I don’t need to first place the dataline into variable dataline before setting pair.input field to hold the contents. Oh well, I guess I can refactor as I go.

def dataline_normal(inside, outside)
  # Declare the array
  dataline = []
  # Convert distance in yards to number less than 1.0, one input
  dataline << inside.distance / 10000.0
  # Check if race is handicap or not, two inputs
  # Handicap = [1,0], non-handicap = [0,1]
  dataline << determine_handicap(inside.handicap)
  # Convert going to thermometer style input, five inputs
  # Slow = [1,0,0,0,0] -> Fast = [1,1,1,1,1]
  dataline << determine_going(inside.going)
  # Check how far away horses are from each other and scale to number less than 1.0, one input
  dataline << diff(inside.draw, outside.draw, 100)
  # Check difference of spdfiglr, one input
  dataline << diff(inside.spdfiglr, outside.spdfiglr, 100)
  # Check difference of shorpro, one input
  dataline << diff(inside.shorpro, outside.shorpro, 100)
  # Check difference of pfp, one input
  dataline << diff(inside.pfp, outside.pfp, 100)
  # Check difference of shoravd, one input
  dataline << diff(inside.shoravd, outside.shoravd, 100)
  # Check difference of raiform, one input
  dataline << diff(inside.raiform, outside.raiform, 1000)
  # Check if runners are moving up or down in money class, four inputs
  # No movement = [0,0,0,0] Inside up, outside down = [1,0,0,1]
  dataline << determine_mcls(inside.mclslr, outside.mclslr)
  # Check difference of acecl, one input
  dataline << diff(inside.acecl, outside.acecl, 1)
  # Check if either runner is a course and/or distance winner or same race winner, eight inputs
  # [c, d, cd, cds] + [c, d, cd, cds]
  dataline << determine_cdwinner(inside.cdwinner, outside.cdwinner)
  # Remove any sub arrays from dataline array
  dataline.flatten.join(",")
end

As much comments as there is code 🙂 Basically I create the inputs one by one by either calculating the difference between the two ratings as explained in an earlier post or creating an array of ones and zeros and appending the results into array dataline. Below you can see the helper methods used in the above piece of code.

def determine_handicap(handicap)
  o = Array.new(2,0)
  if handicap || handicap == "true"
    o[0] = 1
  else
    o[1] = 1
  end
  o
end

def determine_going(going)
  arr = Array.new(5, 0)
  goings = ["Slow","Standard To Slow","Standard","Standard To Fast","Fast"]
  (0..goings.index(going)).each do |i|
    arr[i] = 1
  end
  arr
end

def diff(inside, outside, divisor)
  o = begin (inside - outside) / divisor.to_f rescue 0 end
  if o > 1
    1
  elsif o < -1
    -1
  else
    o
  end
end

def determine_mcls(inside, outside)
  o = Array.new(4,0)
  unless inside.nil?
    inside > 1.07 ? o[0] = 1 : 0
    inside < 0.93 ? o[1] = 1 : 0
  end
  unless outside.nil?
    outside > 1.07 ? o[2] = 1 : 0
    outside < 0.93 ? o[3] = 1 : 0
  end
  o
end

def determine_cdwinner(inside, outside)
  o = Array.new(8,0)
  inside > 0 ? o[inside - 1] = 1 : 0
  outside > 0 ? o[outside + 3] = 1 : 0
  o
end

def determine_output(inside, outside)
  if inside < outside
    1
  else
    -1
  end
end

def progress?(index)
  # Check if there is progress made
  if @progress < (index.to_f / @all * 100).round
    @progressbar.increment
    @progress += 1
  end
end

These are pretty simple stuff, basically some calculations that I prefer not to repeat and / or is nice to keep out from the main blocks of code. That was it for this time. In the next post we actually start looking at the code used in training the network or at least the code were I tell FANN to learn 🙂