Home > Technology > Formulas

Modeling with Formulas

Advanced features of the C++ language, such as overloaded constructors and operators make it possible to create a new language from within C++ itself. The end product is of course still C++, but with enhanced semantics. In the case of Time Series API (TSA), almost all operators have been overloaded to work with TSA's native data types, so as to extend C++'s capabilities in the time series domain.

TSA's native data types are called interface types because they represent an interface layer to the underlying component object model. Using interface data types together with overloaded operators and native API functions results in a new syntax, called formula syntax, which is similar to the syntax used in spreadsheet formulas.

TSA's basic data types are:

  • class Double
  • class Float
  • class Int
  • class Unsigned
  • class Long
  • class String
  • class Bool
  • class DateTime
  • class VoidPtr

Instances of the above types represent individual values. For each of the above types there also exists a corresponding vector type. Vector types are important when functions need to look at 'previous' values. Vector types will be covered shortly.

Most types correspond directly to primitive C/C++ types, where, for example, class Double corresponds to C++'s type 'double' (a 64 bit floating point number). The correspondence is however not exact as class Long always corresponds to a 64 bit integral value, whereas class Int and Unsigned are both always 32 bit. Class VoidPtr can be used to pass pointers to larger structures between components.

The following is a short example of how data types are used in a strategy:

 

 

Strategy s;

 

   Int    a = (Int)BarCount();

   Int    b = a * 2;

   Int    c = b * 3;

   Bool   d = !(a % 2);

   String e = String(a) + " is " + If(d, "'even'", "'odd'");

 

s.SetPeriod( 20030101, 20030115);

 

while(t.EvaluateBar())

{

    file  << a << ", " << b << ", "

          << c << ", " << d << ", "<< e << endl;

}

 


This strategy does very little other than print the values of the data objects at each strategy interval producing the following output:

1, 2, 6, 0, 1 is 'odd'
2, 4, 12, 1, 2 is 'even'
3, 6, 18, 0, 3 is 'odd'
4, 8, 24, 1, 4 is 'even'
(...)

This short example strategy does however illustrate how models are built with TSA library. Instances of data type classes (Double, String, Bool etc.), called data objects, are intialized by formulas, which are similar in concept to formulas used with spreadsheet applications, except that no cell references are used. These formulas are then seemingly evaluated repeatedly during the strategy evaluation process, when the strategy iterates over all time intervals (a.k.a 'bars') between the start date and the end date. Time intervals are of 'daily' length by default, but can also be of any other length, including 'variable' length.

Component Object Model

Formula code is written in a way that suggests that every line of code is evaluated repeatedly at every interval. While this is true for the logic (i.e. the underlying component structure) corresponding to the each line of code, it is not true for the formula code itself. This is of course apparent by the lack of a looping statement around the formula code. The actual evaluation of the underlying component logic is triggered by iterative calls to the strategy's EvaluateBar() member, whereas the formula code itself is executed once only, in a process called translation.

The Illusion of Iterative Evaluation

 //Formula code start

   Int    a = (Int)BarCount();

   Int    b = a * 2;

   Int    c = b * 3;

   Bool   d = !(a % 2);

   String e = String(a) + " is " + If(d, "'even'", "'odd'");

 //Formula code end

Translation refers to the process of turning formula code into a ('tree like') component structure, and is somewhat similar to working with a 'just in time compiler'. TSA's data types thus represent a thin interface to the underlying component structure, and can be considered 'handles'.

Such an architecture has a number of advantages, one of which is that it is very easy to encapsulate logic as functions that have the same 'look and feel' as native API function. A function to add two numbers is simply written as:

Double Add(Double a, Double b){
    return a + b;
}

Writing functions is covered in more detail below.

Although formula code looks as if it is evaluated repeatedly, it really is not! Formulas are translated into a component structure (which involves creating and 'binding' components), and it is this component structure which does all the work during strategy evaluation. Formulas only represent a thin interface layer to the underlying component model. Consider the following lines of code from the earlier example:
 

Int a = (Int)BarCount();

Int b = a * 2;

Int c = b * 3;


Note that the BarCount() function returns a Long, a 64 bit integral, which needs to be explicitly cast to a 32 bit Int to avoid a compiler error in this particular example. These three lines of code translate into the following component structure, where the oval shapes represent components, and the data object (Int a, Int  b, Int c) represent component-outputs:

  

Each of the shapes in the diagram represents a particular component that performs a specific function. Some components perform simple arithmetic operations, while more complex components may engage in complex statistical analysis or access data from various sources. This component based structure is important as the native API library can be extended easily by implementing new components.

The Role of the Strategy Instance


A single strategy instance coordinates and orchestrates all components. This single strategy instance also contains many of the properties that control the simulation.

Although in the earlier example there appears to be no visible connection between the formula code and the strategy instance, every component created by formula code is ultimately owned and controlled by this single strategy instance, as every component registers itself with the strategy immediately upon construction via a 'hidden' strategy pointer.  This 'hidden' strategy pointer is implemented as Thread Local Storage (TLS). Only a single strategy instance should thus exist per thread of execution at any one time!

Avoiding a visible relation between formual code and the strategy instance simplifies both formula code and the implementation of new formula functions. The implementation of formula functions is discussed below.

Working with Time Series

While the example discussed above demonstrated the use of simple operators on basic datatypes such as Int, Bool, and String, there has so far been no use of actual time series functions, such as a 'moving average' function. Time series functions require the ability to look 'back in time' at 'previous' values in the series. This ability to look back in time is made possible by a second set of data classes called vector classes. For every scalar type presented earlier there exists a complementary vector type. These are:

  • class DoubleVector
  • class FloatVector
  • class IntVector
  • class UnsignedVector
  • class LongVector
  • class StringVector
  • class BoolVector
  • class DateTimeVector
  • class VoidPtrVector

Instances of vector classes represent sequential arrays (ordered by datetime) of historical values of a given time series, where the value found at index 0 corresponds the current value, and the value at index 1 represents 'yesterdays' value (assuming 'daily' interval length), and so forth.

Vector types are uses primarily as function parameter types. Consider the declaration of the Ave() formula function: 

         Double Ave(const DoubleVector& data, size_t period);

This function calculates an 'average' over a given period. This average is of course a 'moving averge' since the average is calculated repeatedly at every time interval (bar) of the simulation. However, as a convention, the prefix 'moving' is dropped from all function names. Lets see how the Ave() function is used:


   MemTable mt;
  
   Strategy
t;         

   
   Double num = Rand();

       Double a3  = Ave( num, 3 );   //3 bar moving average
       Double a5  = Ave( num, 5 );
       Double a8  = Ave( num, 8 );
       Double a13 = Ave( num, 13 );

       ORecord( mt, //writing record to memory table
          List(num, a3.As("ave3"), a5.As("ave5"), a8.As("ave8"),
                        a13.As("ave13") ) );

    t.Evaluate( D(2005, JAN, 1), D(2005, FEB, 25) );

    mt.PrintTable(file);

This example is straight forward. The Rand() function returns random numbers, and the Ave() function is called several times to calculate moving averages of different periods. The resulting records are then written to a MemTable ( a memory based table that does not require a database file) and printed.

What appears as odd however is that the parameter passed to the Ave() function is not a vector! How can a 13 bar moving average be calculated from a single number? The reason why this is possible is because scalars types can be cast to vector types on demand! Recall that the Ave() function declares its 'data' parameter as type DoubleVector.

Double Ave(const DoubleVector& data, size_t period);

The compiler will attempt to construct a DoubleVector from any data object passed as 'data' argument, and constructing a DoubleVector from a Double is not a problem. The reason why this is not a problem is due to the fact that formula code is evaluated once only, and the subsequent iterative evaluation of each strategy interval is actually performed by the component structure derived from the formula code during the translation process. This means that all vector allocations actually occur before strategy evaluation begins.

In the strategy output below, note how the first visible record is listed for Jan 20th, when the strategy's start date was actually set as Jan 1st. This delay is caused by the 'lookback' requirement of the calls to the Ave() function. The longest period used was 13, and since weekend dates are automatically skipped, it takes until Jan 20th for the first full record to be submitted:

  __________________________________________________________________________

 | TIMESTAMP   | VALUE     | AVE3      | AVE5      | AVE8      | AVE13     |

 |_____________|___________|___________|___________|___________|___________|

 | 2005-Jan-20 | 0.1134087 | 0.5628002 | 0.6338308 | 0.5623603 | 0.5854806 |

 | 2005-Jan-21 | 0.5251174 | 0.4391819 | 0.5484886 | 0.5338463 | 0.5799172 |

 | 2005-Jan-24 | 0.3496782 | 0.3294014 | 0.5126392 | 0.5634396 | 0.5384965 |

 | 2005-Jan-25 | 0.2635203 | 0.3794386 | 0.3861488 | 0.5384337 | 0.5137234 |

 | 2005-Jan-26 | 0.4882196 | 0.3671394 | 0.3479888 | 0.4804826 | 0.5459061 |

 |_____________|___________|___________|___________|___________|___________|

 | 2005-Jan-27 | 0.1941596 | 0.3152998 | 0.3641390 | 0.4386369 | 0.4861213 |

 | 2005-Jan-28 | 0.7290405 | 0.4704732 | 0.4049636 | 0.4177705 | 0.4842607 |

 | 2005-Jan-31 | 0.8198903 | 0.5810301 | 0.4989661 | 0.4353793 | 0.5386421 |

 | 2005-Feb-01 | 0.1336738 | 0.5608682 | 0.4729968 | 0.4379125 | 0.5132657 |

 | 2005-Feb-02 | 0.7348643 | 0.5628094 | 0.5223257 | 0.4641308 | 0.4965761 |

 |_____________|___________|___________|___________|___________|___________|

 | TIMESTAMP   | VALUE     | AVE3      | AVE5      | AVE8      | AVE13     |

 |_____________|___________|___________|___________|___________|___________|

 | 2005-Feb-03 | 0.2875530 | 0.3853637 | 0.5410044 | 0.4563652 | 0.4780090 |

 | 2005-Feb-04 | 0.7759735 | 0.5994636 | 0.5503910 | 0.5204218 | 0.4687784 |

 | 2005-Feb-07 | 0.6663082 | 0.5766116 | 0.5196746 | 0.5426829 | 0.4678006 |

 | 2005-Feb-08 | 0.5783674 | 0.6735497 | 0.6086133 | 0.5907089 | 0.5035666 |

 | 2005-Feb-09 | 0.4552840 | 0.5666532 | 0.5526972 | 0.5564893 | 0.4981948 |

 

  ( . . .) 

Also note that the first column contains timestamp values. Timestamps are innate to all table schemas and to the strategy object itself. Every strategy interval has its own unique timestamp which is automatically recorded in every table or file to which model data is written.

Endless Simlations

All components that are part of a strategy are are synchronized and orchestrated by the single strategy instance and are always evaluated one bar at a time. This is why the length of an evaluation does not need to be known in advance! This also means that the models can be deployed on real time data for as long as required, with up to 1000 evaluations of the model per second. This highest supported timestamp is Dec 31 of the year 99,999 !

Live data can be fed directly into a model, via special formula functions that read data from variables external to the model.

Writing Functions using Formula Syntax

Recall that TSA data types represented a thin layers of abstraction. They are in fact 'handles' into the underlying (reference counted) component objects. One advantage of this architecture is that implementing functions is greatly facilitated. Data objects, and even collections, such as objects of type Tuple, can thus simply be returned 'by value' as function return values without any consideration to the overhead incurred.

Consider these formula function implementations:


Bool CrossesAbove(const DoubleVector& a, const DoubleVector& b){
    return ( (a > b) && (a[1] <= b[1]) );
}

 
The CrossesAbove() function returns true if the current value of 'a' is larger than the current value of 'b', but not for the previous bar. Note that the return value of type Bool is returned 'by value' whereas the function parameters of type DoubleVector are declared as constant references. This convention is however purely for semantic reasons, as vector types can also be passed simply 'by value'.

Defining Sequences


Long BarCount()
{
    LongVector v(0); //sequence 'seed' set to 0
    v = Prev(v) + 1;
    return v;
}

 
The above BarCount() function demonstrates the use of sequences. Sequences are a feature of vector types, which allows the current value of the vector to be defined as a function of previous values in the same vector. Sequences allow formula code to maintain 'state' between strategy intervals! The above example simply returns a sequence starting at 1 (one larger than the seed value), incremented by 1 at each interval.

Larger Functions


Double Correlation(const DoubleVector& x, const DoubleVector& y, size_t period)
{
    Double sumX  = Sum(x, period);
    Double sumY  = Sum(y, period);
    Double sumXY = Sum( x * y, period);
    Double sumX2 = Sum(Pow(x,2), period);
    Double sumY2 = Sum(Pow(y,2), period);

    Double dividend = ((Long)period * sumXY) - (sumX * sumY);
    Double divisor  = SqRt(((Long)period * sumX2
                - Pow(sumX,2))*((Long)period * sumY2 - Pow(sumY,2)));
    return dividend / divisor;  
}

 
This Correlation function calculates the correlation coefficient between 'x' and 'y' over the given period. There is no limit to the amount of code and complexity that can be encapsulated by a single function!

Returning a Tuple


Tuple BollingerBands(const DoubleVector& _high, const DoubleVector& _low,
    const DoubleVector& _close , size_t _period, double _numStdDev)
{
    Double typicalPrice   = Ave( List(_high, _low, _close) );
    Double shiftBy        = StDev(typicalPrice, _period) * _numStdDev;
    Double typicalPriceMA = Ave(typicalPrice, _period);
    Double upperBand      = typicalPriceMA + shiftBy;
    Double lowerBand      = typicalPriceMA - shiftBy; 
    Tuple res = ListToTuple(
                    List( upperBand.SetName("Upper"),
                          lowerBand.SetName("Lower"),            
                          typicalPriceMA.SetName("Middle")) );
    return res;
}

 
This BollingerBands() function demonstrates how a function can return multiple values in the form of a Tuple object. A tuple derives its names from the postfix of names such as quintuple, sextuple etc. representing sets of multiple values.

(...)
Tuple bb = BollingerBands(high, low, close, 20, 1.8);

Double upper  = bb("upper");  //access 'by name'
Double lower  = bb("lower");
Double middle = bb("middle");
(...)

Returning multiple values as a tuple is possible due to the the lightweight nature of interface type classes to which both class Tuple and class List belong. Lists are usually used as function parameter types whereas tuples are almost exclusively used as function return values.

The difference between the two classes is that a tuple's size is fixed, and each of its value objects has a unique name, whereas data objects stored in a list object do not require unique names. Data object names are used primarily to retrieve values from a collection 'by name' (as shown above), as well as to automatically derive table schema field names and column header names when printing collections to stream.


     Copyright (c) 2008 PERITECH. All Rights Reserved.