Context refactoring #181

neil-tan · 2019-09-16T12:25:49Z

Draft for context changes

Operators on stack
Tensor name and pointer mapping for easier operator porting
Easier breakpoint setting
Extensibility for tensor object-pooling, persistent operators, offline-lifecycle planner for tensor object

Things to consider:

Would the template usages here increase the binary size significantly.
Consider making our own hash-tables, or, a form of tensor-index planning ability

Code snippet below:

typedef utensor::string OpName;
typedef std::unordered_map<TName, Tensor*> TensorNamePtrMap;

struct TensorRecord {
    public:
        unsigned char ref_count = 0;
        Tensor* t = nullptr;
        //bool keep_alive = false;
        TensorRecord(Tensor* _t, unsigned char _ref_count) :
            t(_t), ref_count(_ref_count) {}
};

class Context {
  private:
      std::unordered_map<TName, TensorRecord> tTable;
      //TODO: - Op data manager
      //      - Profiler

  public:
    Context();
    template<class T, typename... Args>
    Tensor* add(TName _name, unsigned char _ref_count, Args&&... args) {
        //pooling can be implemented here
        Tensor* t = new T(std::forward<Args>(args)...);
        t->setName(_name);
        tTable[_name] = TensorRecord(t, _ref_count);
        return t;
    }

    //Tensor lookup interface
    //non-existing tensor: returns Tensor*& but Tensor* is null
    Tensor*& operator[](TName name) {
      //TODO: define behavior for tensor-not-found
      Tensor*& t = tTable[name].t;
      return t;
    };
    
    void invoke(operator *op, OpName _name = 0);  //persistent op exists in heap
    
    //intermediate ops exists on stack
    void invoke(operator &op, OpName _name = 0) {
      //trigger registered actions based on _name here
      op.compute();
    }

    //This tensor removal function is meant to be called by code-gen directly
    void rm(TName t_name) {
      //TODO: check for t_name's existance
      delete tTable[t_name].t;
      tTable.erase(t_name);
    }
    
    //decrease ref count of used tensors and perform deletion
    //NT: Template based function worries me, as one copy of the function may be generated per op use
    template <class T>
    void gc(T &t_struct) {
      for (unsigned char i = 0; i < (sizeof(T) / sizeof(Tensor*)); i++) {
        TName t_name = ((Tensor*) &t_struct)[i]->name;
        unsigned char c = tTable[t_name].ref_count - 1;
        if(c <= 0) {
          rm(t_name);
        } else {
          tTable[t_name].ref_count = c;
        }
      }
    }

};

//Code for operators 

template <class T1, class T2, class TOut>
class MatMulOp : public Operator {
  public:
    struct {
      Tensor* input;
      Tensor* dim;
    } inputs;

    struct {
      Tensor* output;
    } outputs;

  MatMulOp() {
    //similar to TFLM's prepare function
  }
  virtual void compute() override {
    MatMul2<T1, T2, TOut>(inputs.input, inputs.dim,
     outputs.output);
  }
};

class Operator : public uTensor {
public:
  virtual void compute() = 0;
};


//// Example for Generated Code

//Old
{
    RamTensor<float>* out_tensor;
    out_tensor = new RamTensor<float>({ 1 });
    ctx.add(out_tensor, "MatMul_eightbit/x__port__0/min:0", 1);
    ctx.push(new MinOp(), 
             { "MatMul_eightbit/x__port__0/reshape:0", "MatMul_eightbit/x__port__0/reduction_dims:0" },
             { "MatMul_eightbit/x__port__0/min:0" });
    ctx.eval();
}

//New

{
    //Keeping one-off operators on the stack
    //Stateful operators can be allocated on the heap or a more global scope
    MinOp op();
    op.inputs.input = ctx["MatMul_eightbit/x__port__0/reshape:0"];
    op.inputs.dim = ctx["MatMul_eightbit/x__port__0/reduction_dims:0"];
    //ctx.add() registers Tensor* with the context and returns the same Tensor*
    //Tensor instanization all happen at the same place now praving way for instance pooling
    op.outputs.output = ctx.add<RamTensor<float>>("MatMul_eightbit/x__port__0/min:0", 1, { 1 });

    //setting a breakpoint here should take you (almost) straight to the kernel
    ctx.invoke(op, "op_name_from_code_gen"); //with or without supplying the name

    //logic for reference-counting clean up
    ctx.gc(op.inputs);  //or, use code-gen to delete all the input tensors by calling rm()
}

mbartling · 2019-09-16T19:33:02Z

FYI, this will not compile for more than 1 types T unless the typename args are all different types, You can get around it by appending an optional type at the end T x = std::get_default<T>()

  Tensor* add(TName _name, unsigned char _ref_count, Args&&... args) {```

The input pointers to operators are really scary, especially if the tensors are stored on a destructive data structure like an arena. These will have to be checked prior to reading to make sure they are resident/readable/writeable. 
```    struct {
    Tensor* input;
    Tensor* dim;
  } inputs;

  struct {
    Tensor* output;
  } outputs;```

mbartling · 2019-09-16T20:02:58Z

I didn't have a lot of time, but here's a refactored draft proposal:

https://github.com/uTensor/uTensor/blob/proposal/new_interface/proposals/massive_context_refactor.cpp

dboyliao · 2019-09-17T14:33:08Z

@mbartling I don't get it.
Why does @neil-tan's implementation won't compile unless typename Args... are all different types?
I don't see there is such constraint on variadic template.

neil-tan · 2019-09-17T14:49:51Z

It currently compiles and works with the code below. But, I can dig deeper:

class Factory {
    public:
    Factory() {printf("factory created\r\n");}
    template<class T, class T2, typename... Args>
    T* create(T2&& n, Args&&... args) {
        printf("create with integer: %d\r\n", forward<T2>(n));
        return new T(std::forward<Args>(args)...);
    }
};

mbartling · 2019-09-17T15:41:50Z

@dboyliao typename Args... generally expands to T1 arg1, T2 arg2, ... so it's possible depending on the context to get duplicate function definitions. The major point is that the input type T is not actually a parameter in the function signature Tensor* add(TName _name, unsigned char _ref_count, Args&&... args) {

dboyliao · 2019-09-17T15:45:26Z

Oh, I see.
That is an issue, really.

neil-tan · 2019-09-20T10:10:45Z

An updated draft, see minor changes in the code snippet.
The intention is to incorporate some ideas proposed by @mbartling in his draft above.
In his proposal, primarily:

Tensor-handle
Tensor-allocator
operator input-output builder/mapper

The updated proposal here address these points:

While tensor-handle would enable more flexibility in-terms of runtime memory management and act as a layer of dense while programming with raw-pointers, developers and users are unlikely to work with pointers directly with the aid of code-gen. Offline memory planning should be adequate for most use-cases. With the update proposal, tensor-handle can be added in the future if necessary.
Tensor-allocator will be incorporate. A custom allocator enables more predictable memory pattern in a constrained system. Factory-pattern will be removed from the add method in favor of accepting raw-pointers, this intends to minimize the use of template functions.
operator input/output map won't be implemented in the current draft. In-class anonymous input/output structs provide similar level of readability and require fewer lookups.

Most modifications are around the code relating to Tensor* add(...)

typedef utensor::string OpName;
typedef std::unordered_map<TName, Tensor*> TensorNamePtrMap;

struct TensorRecord {
    public:
        unsigned char ref_count = 0;
        Tensor* t = nullptr;
        //bool keep_alive = false;
        TensorRecord(Tensor* _t, unsigned char _ref_count) :
            t(_t), ref_count(_ref_count) {}
};

class Context {
  private:
      std::unordered_map<TName, TensorRecord> tTable;
      //TODO: - Op data manager
      //      - Profiler
      //      - Allocator

  public:
    static TensorAllocator tensor_allocator;
    Context(TensorAllocator t_alloc) : tensor_allocator(t_alloc) {};
    //TODO: modifying this to support pooling
    Tensor* add(Tensor* t, TName _name, unsigned char _ref_count = 1) {
        t->setName(_name);
        tTable[_name] = TensorRecord(t, _ref_count);
        return t;
    }

    //Tensor lookup interface
    //non-existing tensor: returns Tensor*& but Tensor* is null
    Tensor*& operator[](TName name) {
      //TODO: define behavior for tensor-not-found
      Tensor*& t = tTable[name].t;
      return t;
    };
    
    void invoke(operator *op, OpName _name = 0);  //persistent op exists in heap
    
    //intermediate ops exists on stack
    void invoke(operator &op, OpName _name = 0) {
      //trigger registered actions based on _name here
      op.compute();
    }

    //This tensor removal function is meant to be called by code-gen directly
    void rm(TName t_name) {
      //TODO: check for t_name's existance
      delete tTable[t_name].t;
      tTable.erase(t_name);
    }
    
    //decrease ref count of used tensors and perform deletion
    //NT: Template based function worries me, as one copy of the function may be generated per op use
    template <class T>
    void gc(T &t_struct) {
      for (unsigned char i = 0; i < (sizeof(T) / sizeof(Tensor*)); i++) {
        TName t_name = ((Tensor*) &t_struct)[i]->name;
        unsigned char c = tTable[t_name].ref_count - 1;
        if(c <= 0) {
          rm(t_name);
        } else {
          tTable[t_name].ref_count = c;
        }
      }
    }

};

//Code for operators 

template <class T1, class T2, class TOut>
class MatMulOp : public Operator {
  public:
    struct {
      Tensor* input;
      Tensor* dim;
    } inputs;

    struct {
      Tensor* output;
    } outputs;

  MatMulOp() {
    //similar to TFLM's prepare function
  }
  virtual void compute() override {
    MatMul2<T1, T2, TOut>(inputs.input, inputs.dim,
     outputs.output);
  }
};

class Operator : public uTensor {
public:
  virtual void compute() = 0;
};


//// Example for Generated Code

//Old
{
    RamTensor<float>* out_tensor;
    out_tensor = new RamTensor<float>({ 1 });
    ctx.add(out_tensor, "MatMul_eightbit/x__port__0/min:0", 1);
    ctx.push(new MinOp(), 
             { "MatMul_eightbit/x__port__0/reshape:0", "MatMul_eightbit/x__port__0/reduction_dims:0" },
             { "MatMul_eightbit/x__port__0/min:0" });
    ctx.eval();
}

//New

{
    //Keeping one-off operators on the stack
    //Stateful operators can be allocated on the heap or a more global scope
    MinOp op();
    op.inputs.input = ctx["MatMul_eightbit/x__port__0/reshape:0"];
    op.inputs.dim = ctx["MatMul_eightbit/x__port__0/reduction_dims:0"];
    //ctx.add() registers Tensor* with the context and returns the same Tensor*
    op.outputs.output = ctx.add(new RamTensor<float>({ 1 }), "MatMul_eightbit/x__port__0/min:0", 1);

    //setting a breakpoint here should take you (almost) straight to the kernel
    ctx.invoke(op, "op_name_from_code_gen"); //with or without supplying the name

    //logic for reference-counting clean up
    ctx.gc(op.inputs);  //or, use code-gen to delete all the input tensors by calling rm()
}

mbartling · 2019-09-20T13:39:20Z

@neil-tan "While tensor-handle would enable more flexibility in-terms of runtime memory management and act as a layer of dense while programming with raw-pointers, developers and users are unlikely to work with pointers directly with the aid of code-gen. Offline memory planning should be adequate for most use-cases. With the update proposal, tensor-handle can be added in the future if necessary." False, literally all of our operators currently do something like
uint8_t* x = t->read(0,0);
then process with x. Do not ever assume people will not use something that is accidentally exposed. And exposing raw pointers is extraordinarily dangerous and should not be taken lightly. I am not convinced that offline memory planning can handle enough use cases (large enough useful graphs) to warrant not including a method for runtime memory management. Especially, if you consider the use-case where models are run asynchronously in response to some external event (on picture capture, on microphone listen, etc.).

neil-tan · 2019-09-20T14:30:48Z

@mbartling To reiterate, the raw tensor-pointers are exposed within the uTensor code-base itself. Application developers and the majority of uTensor users will not be interacting with tensors directly (upcoming user-api changes) nor dealing with operators. Though, the contributors and core-developers will have to deal with operators and tensor pointer; they need to know how to work with raw-pointers anyways.
We need to be very specific on what uTensor is. Given the resource we have, I am hesitant to cover anything more than Cortex-M at the moment. There are advantages in keeping things simple and stupid. Future proofing is good. However, unless we can utilize these functions now, overengineering pays a tax in complexity.
For larger graphs, tools like Tensorflow Lite Micro and Arm-NN exist. We should not reinvent the wheels. Instead, we should focus on design decisions that will make uTensor standout.
The core issue here seems to be wether we should support application processors. To which I'm inclined to say, not a priority at the moment. However, for such question relating to a broader road-map, we can discuss offline if desired.

mbartling · 2019-09-20T15:44:43Z

Assumptions:

Given a model, memory plan, dedicated memory allocator, and no external factors then subsequent model eval will result in the same sequence of allocations.
Two independent models each with their own memory plan, dedicated memory allocator, and no external factors will likewise have the same sequence of allocations on each memory allocator for subsequent calls to model eval

Consequence: independent model evals maintain deterministic behavior of operation and therefore have a plannable dataflow graph at codegen.

Dependent actions such as branching or one models execution path depending on the valuation of another model impose non-deterministic probabilistic path execution on the graph. Therefore, two subsequent calls to model eval may result in two completely different memory allocation traces. At best, codegen can estimate the likelihood of taking a particular branch given some knowledge of data distributions. this is a learning problem, but practically doesn't scale from a tooling perspective.

Consequence: Runtime allocator should be able to operate on best effort goals. That is, it should attempt to use the offline memory plan as it can do this very quickly. However, it should also be able to handle error/edge cases, such as fragmentation, then attempt to handle them internally or notify the user application of error, and allow to grow, shrink, or simply fault on exceptions.

This naturally leads to the Tensor handles. A Tensor Handle has the following properties:

A tensor handle is bound to the underlying data/meta data of a tensor
- If the underlying data is updated, this knowledge is propagated to all "copies" (not actually copies) of the tensor handle
- To save on space, the tensor handles are never actually copied, rather all are required to be passed by reference or behave like references
- Special case: tmp; ex) swap(a,b) tmp = a; a = b; b = tmp;
Tensor Handles are used as the high level interface to all tensor types, and handle the delegation of polymophic behavior without imposing any extra requirements on the users. This interface should feel natural and objecty.

Example)

Tensor a = RomTensor<uint8_t>(a_data, a_shape);
Tensor b = RomTensor<uint8_t>(b_data, b_shape);
Tensor c = RamTensor<uint8_t>(max(a_shape, b_shape));
// Assume this is a 4x4x4 tensor
c = Add<uint8_t>(a, b);
Tensor c2 = c; // c2 is bound to the same data as c, deep copy must happen explicitly

// Read interface
std::cout << c(1, 2, 3) << std::endl;

// write interface
c(1,2,3) = 10;

std::cout << c2(1,2,3) << std::endl; // outputs 10;

// Check to see if tensor is still accessible
if(c) // if not null
 foo();

neil-tan · 2019-09-23T15:53:54Z

Point #3 is taken: branching and non-deterministic simultaneous graph evaluation calls.

With a runtime allocator, it's possible that Tensor will be shuffled around in memory in response to the system states. There are 2 ways to facilitate this:
** Tensor-handles **

The tensor-handle object points to (wraps) the actual tensor object
Tensor-handles are always being passed/shared as references
Updating the tensor-handle therefore update all the references to the actual tensor object

** Tensor-lookups **

A tensor-lookup class accept a utensor::string name and returns the pointer to the actual tensor object
The class can dynamically changes the binding between the name and managed tensor-objects.
Operators lookup the tensors before its evaluation

@mbartling We should keep the syntax simple. I prefer not explicitly making pass-by-reference a requirement for tensor-handles. How would you implement this? Tensor c2 = c;

@Knight-X

mbartling · 2019-09-23T16:02:35Z

"We should keep the syntax simple. I prefer not explicitly making pass-by-reference a requirement for tensor-handles. How would you implement this? Tensor c2 = c;"
-> We have two options here, either register each copy tensor handle in the allocator (silly), or construct a singleton type for the "reference" that makes it behave like pass by reference.

neil-tan · 2019-09-23T16:31:44Z

@mbartling I tried to monkey around with CopyTest& operator= (CopyTest& obj). Can't achieve the syntax we wanted with it.

register each copy tensor handle in the allocator

nah

singleton type

are you defining a new class for every handle? or something else?

mbartling · 2019-09-24T13:18:09Z

Can you give me some requirements for user interface here? I much prefer we only have one allocation of the Tensor handle with shared references. It makes cleanup 10000x easier.

Knight-X · 2019-09-25T11:33:27Z

Point #3 is taken: branching and non-deterministic simultaneous graph evaluation calls.

With a runtime allocator, it's possible that Tensor will be shuffled around in memory in response to the system states. There are 2 ways to facilitate this:
** Tensor-handles **

The tensor-handle object points to (wraps) the actual tensor object

Tensor-handles are always being passed/shared as references

Updating the tensor-handle therefore update all the references to the actual tensor object

** Tensor-lookups **

A tensor-lookup class accept a utensor::string name and returns the pointer to the actual tensor object

The class can dynamically changes the binding between the name and managed tensor-objects.

Operators lookup the tensors before its evaluation

@mbartling We should keep the syntax simple. I prefer not explicitly making pass-by-reference a requirement for tensor-handles. How would you implement this? Tensor c2 = c;

@Knight-X

I think that update a tensor ptr which is also propagated to other tensors which are referenced to it.

neil-tan · 2019-09-25T15:57:35Z

One common theme around the syntax here for both examples, op.inputs.input = ctx["MatMul_eightbit/x__port__0/reshape:0"];, and, Tensor c2 = c;, are they only involved only a assignment operator each. Keep this convention helps us to keep the codebase clean and consistent.

I think what @Knight-X is suggesting is having reference to pointers. That might probably work?

@mbartling , there are cases to be made about runtime allocation (multiple inferences and shared tensor-arena). Though, dynamically moving tensors after they are allocated, may not be an immediate requirement. My view is that, tensor-handle is ok, given that we can meet the requirement mentioned above. Otherwise, I think the complexity outweighs the benefit at this point.

mbartling · 2019-09-25T19:14:43Z

@neil-tan TensorHandles are references to pointers XD

As for the op.inputs.input = ctx["MatMul_eightbit/x__port__0/reshape:0"];, and, Tensor c2 = c;

That's easy as long as we differentiate between a tensor handle copy and an "original". Easiest solution here is to maintain a union ptr to either a raw tensor(mutable) or an internal reference to a TensorHandle& (which by language nature cannot change) Then the logic is simply something along the lines of just do a dereference call, if the data field is a TensorHandle then it will call the tensorHandles dereference, else it will dereference the actual data. Worst case, you do 3 cheap function calls.

// Tensors also appear on the same heap as the Tensor metadata. This way we can move tensors around and delete them without affecting user code
//template <typename Allocator=utensor::DefaultTensorMetaDataAllocator>
enum class TensorDataFieldType {
  TensorType, TensorInterfaceType
};
class Tensor {
    private:
        TensorDataFieldType type;
        union _reference {
           utensor::TensorInterface* tensor_interface_ptr;
           const Tensor& tensor_ref;
         }

    public:  
         Tensor(const Tensor& that) {
            type = TensorType;
            _reference.tensor_ref = &that;
         } // Assume have the assignment op as well

        utensor::TensorInterface* operator->(0) { 
            if(type == TensorType){
                return _reference.tensor_ref.operator->(0);  // Returns a TensorInterface*
            } else{
                return _reference.tensor_interface_ptr; // Also a TensorInterface*
            }  
        }
        Tensor(utensor::TensorInterface* ptr) : type(TensorInterfaceType), _reference.tensor_interface_ptr(ptr) {
            Context::DefaultTensorMetaDataAllocator::bind(this, ptr);
        }
        // Add some bits to make the interface nicer to the user

        // Force everything to be on the utensor allocator
        void* operator new(size_t sz) { // Have to delegate this size from tensors somehow + sizeof(Tensor)
            void* p = Context::DefaultTensorMetaDataAllocator::allocate(sz); 
            return p;
        }
        void operator delete(void* p) {
            Context::DefaultTensorMetaDataAllocator::deallocate(p);
        }

        // KEY BIT
        friend class utensor::AllocatorInterface;
};

mbartling · 2019-11-04T23:50:22Z

https://github.com/mbartling/uTensor/blob/proposal/new_interface/proposals/massive_context_refactor.cpp

neil-tan · 2019-12-20T06:07:03Z

Summarizing an offline discussion I had with @mbartling:

Going ahead with Tensor-handle
Prefer Operator's inner-structs for input/output tensors over TensorMap. However, TensorMap will still be in the proposal for future reference.
Try eliminating unnecessary function calls/indirections. We are trying to reduce latency.

neil-tan assigned janjongboom, Knight-X, mbartling, dboyliao and neil-tan Sep 16, 2019

neil-tan mentioned this issue Nov 10, 2019

Proposal/rearch #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context refactoring #181

Context refactoring #181

neil-tan commented Sep 16, 2019

mbartling commented Sep 16, 2019

mbartling commented Sep 16, 2019

dboyliao commented Sep 17, 2019

neil-tan commented Sep 17, 2019

mbartling commented Sep 17, 2019

dboyliao commented Sep 17, 2019

neil-tan commented Sep 20, 2019 •

edited

mbartling commented Sep 20, 2019 •

edited

neil-tan commented Sep 20, 2019

mbartling commented Sep 20, 2019

neil-tan commented Sep 23, 2019

mbartling commented Sep 23, 2019

neil-tan commented Sep 23, 2019

mbartling commented Sep 24, 2019

Knight-X commented Sep 25, 2019

neil-tan commented Sep 25, 2019

mbartling commented Sep 25, 2019

mbartling commented Nov 4, 2019

neil-tan commented Dec 20, 2019

Context refactoring #181

Context refactoring #181

Comments

neil-tan commented Sep 16, 2019

mbartling commented Sep 16, 2019

mbartling commented Sep 16, 2019

dboyliao commented Sep 17, 2019

neil-tan commented Sep 17, 2019

mbartling commented Sep 17, 2019

dboyliao commented Sep 17, 2019

neil-tan commented Sep 20, 2019 • edited

mbartling commented Sep 20, 2019 • edited

neil-tan commented Sep 20, 2019

mbartling commented Sep 20, 2019

neil-tan commented Sep 23, 2019

mbartling commented Sep 23, 2019

neil-tan commented Sep 23, 2019

mbartling commented Sep 24, 2019

Knight-X commented Sep 25, 2019

neil-tan commented Sep 25, 2019

mbartling commented Sep 25, 2019

mbartling commented Nov 4, 2019

neil-tan commented Dec 20, 2019

neil-tan commented Sep 20, 2019 •

edited

mbartling commented Sep 20, 2019 •

edited