Introduction

Recently I have come across something that reminded me of a challenge

I tried some time ago, ITAs bitvector genealogy

challenge(1). Back in the day when I was first

learning scheme, I have tried to use this to build up my skills with

the help of a great tutorial(3).

Needless to say that I failed at that, both for my lack of scheme and

common lisp competence but also for my lack of aptitude as a

programmer.

The something was a thread in Hacker News where ITA was mentioned. I

became curious to see how much I had progressed since my last attempt,

so I grabbed DrRacket, Emacs and jumped into the fray one more time.

Again, most of the original hard work was done by Patrick May. What I

am showing here is a take on his work, translated to scheme

and modified to use different data structures and different points of

views on the same fundamental algorithm.

Ok, this is the description of the problem:

The BitVectors are an ancient and immortal race of 10,000, each with

a 10,000 bit genome. The race evolved from a single individual by

the following process: 9,999 times a BitVector chosen at random from

amongst the population was cloned using an error-prone process that

replicates each bit of the genome with 80% fidelity (any particular

bit is flipped from parent to child 20% of the time, independent of

other bits).

Write a program to guess the reproductive history of BitVectors from

their genetic material. The randomly-ordered file

bitvectors-genes.data.gz contains a 10,000 bit line for each

individual. Your program's output should be, for each input line,

the 0-based line number of that individual's parent, or -1 if it is

the progenitor. Balance performance against probability of mistakes

as you see fit.

(1)

The next step is really the crux of the problem. This is where a bit

of mathematics comes into play

The next step in developing a solution is to determine whether or not

a parent-child relationship exists between two bit vectors. According

to the problem definition, every bit vector other than the original is

created by randomly mutating the elements of an existing bit

vector. The probability of a particular bit changing is 20%. That

means that, on average, the children of a bit vector should be 80%

identical to their parent and the grandchildren should be 64%

similar. Of course, the random element means that some will have more

than 80% similarity and others less. Some children may be more

similar to each other than either is to their parent. The best result

any automated solution will be able to achieve is the most probable

relationship tree.

Assuming that a fair random number generator was used, the set

containing the number of bits different between a parent and each of

its children should follow a binomial distribution. The probability of

two vectors having a parent child relationship is related to how close

the difference between the two is to the expected value (20%, in this

case). In a binomial distribution (which approximates a continuous

normal distribution), 99.73% of the bit difference values will be

within three standard deviations of the expected value. The standard

deviation is computed by:

Body

First, I will define some globals that will be useful throughout the

source code, and the utilities to load the bitvectors data into a

vector of numbers:

(define ROOT-SYMBOL -1) (define MUTATION-RATE 0.2) ;; http://mathworld.wolfram.com/BinomialDistribution.html (define DEVIATION 3) ;; 3 = 99.73 percent (define DNA-LENGTH 0) (define (get-dna) DNA-LENGTH) (define (set-dna! number) (set! DNA-LENGTH number)) ;; filename->bitvector : input-port -> (vectorof binary-number) ;; read everything from a given file and returns a ;; vector containing binary numbers. (define (filename->bitvector file-path (radix 2)) (local [(define (read-accumulative bitvector) (let ([line-read (read-line file-path 'any)]) (if (eof-object? line-read) bitvector (let ([possible-number (string->number line-read radix)]) (read-accumulative (if (number? possible-number) (vector-append bitvector (vector possible-number)) bitvector))))))] (cond [(equal? "" file-path)] [else (read-accumulative (vector))])))

The next part is really the crux of the problem. This is where a bit

of mathematics comes into play (2)

and the most complex part of the problem in my view:

The next step in developing a solution is to determine whether or not

a parent-child relationship exists between two bit vectors. According

to the problem definition, every bit vector other than the original is

created by randomly mutating the elements of an existing bit

vector. The probability of a particular bit changing is 20%. That

means that, on average, the children of a bit vector should be 80%

identical to their parent and the grandchildren should be 64%

similar. Of course, the random element means that some will have more

than 80% similarity and others less. Some children may be more

similar to each other than either is to their parent. The best result

any automated solution will be able to achieve is the most probable

relationship tree.

Assuming that a fair random number generator was used, the set

containing the number of bits different between a parent and each of

its children should follow a binomial distribution. The probability of

two vectors having a parent child relationship is related to how close

the difference between the two is to the expected value (20%, in this

case). In a binomial distribution (which approximates a continuous

normal distribution), 99.73% of the bit difference values will be

within three standard deviations of the expected value.

(3)

Taking that, we can establish the existence of a relationship (with

some degree of confidence) with the following code:

;; binary-difference : binary-number binary-number -> integer ;; returns the number of bits that are different between 2 binary numbers. (define (binary-difference bits1 bits2) (let ([differences 0]) (for ((i (in-string (number->string (bitwise-xor bits1 bits2) 2)))) (when (equal? i #\1) (set! differences (add1 differences)))) differences)) ;; binomial-standard-deviation : number -> number ;; Compute the expected binomial standard deviation for N events of probability P. (define (binomial-standard-deviation n p) (sqrt (* n p (- 1 p)))) ;; has-kinship? : integer integer -> boolean ;; verify if 2 individuals share a kinship. ;; The relationship is assumed if the bit difference between the two is ;; less than DEVIATIONS standard deviations from the expected average for ;; the MUTATION-RATE. (define (has-kinship? individual-1 individual-2 (mutation-rate MUTATION-RATE) (deviations DEVIATION) (dna-length (get-dna))) (let* ([expected-difference (* dna-length mutation-rate)] [standard-deviation (binomial-standard-deviation dna-length mutation-rate)] [max-difference (+ expected-difference (* deviations standard-deviation))]) (< (binary-difference individual-1 individual-2) max-difference)))

The code above shows in some ways where I deviated from Patrick's

example. I used a number instead of a bit array (which does not exist in

Racket, the language I'm using, as far as I know). In the following

code, this difference deepens, with the construction of a hash map

instead of a vector to get the relationship matrix:

;; create-relationship-hash : (vectorof number) -> hash ;; creates a hash mapping an individual to it's list of relationships. ;; Both parents and children are mapped to an individual. (define (make-relationship-hash bitvector) (let ([relationship-hash (make-hash)] [dna-range (in-range 0 (vector-length bitvector))]) (for* ((individual-1-index dna-range) (individual-2-index dna-range)) (begin (unless (hash-has-key? relationship-hash individual-1-index) (hash-set! relationship-hash individual-1-index empty)) (let ([individual-1 (vector-ref bitvector individual-1-index)] [individual-2 (vector-ref bitvector individual-2-index)]) (when (and (has-kinship? individual-1 individual-2) (not (= individual-1 individual-2))) (hash-update! relationship-hash individual-1-index (λ (relationships) (cons individual-2-index relationships))))))) relationship-hash))

This almost closes the issue, but I must figure out the directions of

the relationships(who is who's father and son). To establish that I

should have to:

- Find all the bit vectors with only a single parent-child relationship

- Assume that those vectors are the child nodes

- Record that information

- Remove the child node from the parent's list of relationships

- Repeat

(3)

That is accomplished in the longest and most complex code in the solution:

;; make-kinship-hash : hash -> hash ;; creates a hash mapping an individual only to it's parent. (define (make-kinship-hash relationship-hash-initial) (local [(define (make-kinships kinship-hash relationship-hash) (local [(define (process-possible-kinship individual-index relationships-indexes) (let ([number-of-relationships (length relationships-indexes)]) (unless (> number-of-relationships 1) (begin (hash-remove! relationship-hash individual-index) (if (= number-of-relationships 1) ; leaf node (begin (hash-set! kinship-hash individual-index (first relationships-indexes)) (hash-update! relationship-hash (first relationships-indexes) (λ (relationships) (remove individual-index relationships)))) ; root node (begin (hash-remove! relationship-hash individual-index) (hash-set! kinship-hash individual-index ROOT-SYMBOL)))))))] (if (false? (hash-iterate-first relationship-hash)) kinship-hash (begin (hash-for-each relationship-hash process-possible-kinship) (make-kinships kinship-hash relationship-hash)))))] (make-kinships (make-hash) relationship-hash-initial)))

And that is the end of the solution! I am still facing the same issues

as Patrick, specially in regard to the root individual, but

reimplementing his solution and trying to achieve a better result by

the use of different data structures led me to understand the problem

more deeply, and it was loads of fun!

PS: There is more code in

this

repository, specially regarding testing and comparison.

## Bibliography

- 1

ITA.

Bitvector genealogy.

URL

`http://www.itasoftware.com/careers/work-at-ita/hiring-puzzles.html`.

- 2

MathWorld.

Binomial distribution.

URL`http://mathworld.wolfram.com/BinomialDistribution.html`.

- 3

Patrick May.

Bitvector genealogy.

URL`http://www.softwarematters.org/bitvector-genealogy.html`.

## Nenhum comentário:

## Postar um comentário