_____ ___ ___ ___ _ _ _____ _____ / ___/ / |/ | / | | | | | / ___| | _ \ | |___ / /| /| | / /| | | | | | | | | |_| | \___ \ / / |__/ | | / / | | | | | | | | _ | ___/ ___| | / / | | / / | | | |___ | |___ | |_| | | | /_____/ /_/ |_| /_/ |_| |_____| |_____| \_____/ |_| 1.0 a compact implementation of Symbolic Regression ---------------------------------------------------------------- Author: Hannes Planatscher (info@potschi.de) Department of Computer Science (WSI) University Tuebingen (Germany) Content: I) Introduction II) Installation/Compilation III) Usage IV) Implementation Details 1) Representation of individuals 2) Crossover 3) Mutation 4) Strategy 5) Initialization V) References VI) License I) Introduction SmallGP is a implementation of Symbolic Regression [1] wich is compact by means of lines of code and memory usage, developed as an entry for the "Tiny GP"-Coding-Contest [2] at the Genetic and Evolutionary Computation Conference 2004. II) Installation Instructions tar, gcc and make are needed to compile SmallGP. It has succesfully been compiled under Linux, FreeBSD, Solaris and Win32 (with djgpp). Unpack smallgp-1.0.tar.gz: > tar xvfz smallgp-1.0.tar.gz Move to the created directory: > cd smallgp-1.0 Run Make: > make Some compiler warnings may occur, but after compilation you should find a working version of smallgp in the same directory. III) Usage SmallGP is a commandline-tool. All important parameters can be changed easily. Display the help with > smallgp -h to learn which parameters are avaiable. Most of the parameters and their default values are explained in [2]. Example parameters for a test run: dataset: test.data population size: 10000 tournament size: 100 maximum generations: 80 maximum prog. length: 50 maximum depth: 7 #constants: 200 Pcross: 0.7 Pmut/Node: 0.1 To start smallgp with these parameters just run > smallgp -i test.data -p 10000 -t 100 -g 80 -l 50 -d 7 -r 200 -c 0.7 -m 0.1 . Warning: Parameters are not checked for validity (ex. negative probabilities etc.). IV) Implementation Details 1) Representation of individuals SmallGP uses the concept of genotype-phenotype-duality. Consider the following tree: (+) / \ x1 (+) / \ x2 c2 This can be written in polish notation [3]: + x1 + x2 c2 Due limited amount of different symbols it is possible to store this expression in a vector of integer values. (+) / \ x1 (+) => + x1 + x2 c2 => (2,5,2,6,16) / \ x2 c2 The integer vector represents the genotype, while the phenotype is a standard binary-tree, that consists of nodes containing information which are holding pointers to their children and their predecessor. struct node { struct node *right; struct node *left; struct node **bp; unsigned char no; }; node->no specifies the primitive function/terminal represented by this node. It can easily be deducted that the maximum number of different primitves and terminals is limited to 255 with using an 8-bit datatype. This phenotypic representation enables easy implementation of crossover and evaluation. With 104 bits memory needed per node it isnt really memory efficient, as every node contains just 8 bit of information, so we use the genotype for storage. The phenotype is easily converted to the genotype: a vector of 8-bit values. The genotype is far more memory efficient than the phenotypic representation, but the implementation of subtree crossover and evaluation is less trivial. SmallGP uses the phenotype-genotype-duality to save memory and to allow easy manipulation. While for evaluation and crossover individuals are temporarily expanded to their phenotype, point mutation can be performed on the genotype. 2) Crossover SmallGP implements a size-safe kozastyle subtree-crossover[4]. Subtrees of appropriate size are selected from two individuals, and then exchanged. It is guaranteed that the size constraint is met by both individuals after the operation. Suppose that size(tree1) < size(tree2). tree1 can be extended by (MAXLEN - size(tree1)) nodes. The maximum size of a subtree of tree2 is size(tree2) so we fix a maximum size for the subtree selected from tree2 by a positive random integer value smaller than ((MAX_LEN - size(tree1)) modulo size(tree2)) = maxsize2 After the selection of a subtree subtree_2 with a maximal size maxsize2 from tree2, we can determine maxsize1. tree2 can be extended by (MAXLEN - size(tree2) + size(subtree2)). As the maximum size of a subtree from tree1 is size(tree1) we choose a positive random integer value not bigger than ((MAX_LEN - size(tree2) + size(subtree2)) modulo size(tree1)) = maxsize1 . 3) Mutation SmallGP implements point-mutation, that simply substitutes a terminal with another terminal, and a primitive with another primitive [5]. 4) Strategy SmallGP uses a generational strategy, with tournament selection. 5) Initialization The tree in the initial population are initialized with a size-safe "Grow"-Method. The probabilty of choosing a terminal at a certain depth is obtained as follows: |primitives| + |variables| + 1 P_term = ------------------------------ |variables| + 1 Constants are considered as one virtual terminal symbol. Due to the "Grow"-Method most of the individuals are very small in the initial population. This, in combination with the previously defined point mutation, can cause lack of structural innovation in some cases. So an additional parameter P_term_really is introduced that allows to scale between "Grow" and "Full" initialization. A terminal is set with the probability: P_term' = P_term * P_term_really . With P_term_really = 1 the method equals "Grow", with P_term_really = 0 "Full". The defaultvalue of Pterm is 1. V) References [1] Genetic Programming, Symbolic Regression http://www.genetic-programming.org/ [2] Tiny GP specification: http://cswww.essex.ac.uk/staff/sml/gecco/TinyGP.html [3] Polish Notation http://www.cenius.net/refer/display.php?ArticleID=polishnotation [4] Subtree Crossover http://www.geneticprogramming.com/Tutorial/#anchor179890 [5] Genetic Programming with One-Point Crossover and Point-Mutation (Poli, Langdon) http://cswww.essex.ac.uk/staff/poli/papers/Poli-WSC2-1997.pdf VI) License This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 US