Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context refactoring #135

Open
Knight-X opened this issue Nov 4, 2018 · 8 comments
Open

context refactoring #135

Knight-X opened this issue Nov 4, 2018 · 8 comments

Comments

@Knight-X
Copy link
Member

Knight-X commented Nov 4, 2018

  1. Memory Arrangement: a customized new/delete function. Essentially, pre-allocate a block of memory during the context initialization. The memory will be used for caching ops and memory management of tensor as an attempt to reduce the overhead associated with new/delete of the current approach. Moreover, we can arrange memory precisely because of known memory access pattern.

  2. rewrite context class:

  3. Instead of allocating memory for op everytime, we would just record the name of op and allocate the memory once or in lazy way.

  4. using op_table to register the class

  5. remove rtable

  6. In the push function, we need to check whether the op is allocated or allocate the op.

@Knight-X
Copy link
Member Author

Knight-X commented Nov 22, 2018

units: Sec
op: MatMul
push: 0.000991
eval: 0.133074
op: Quantized Add
push: 0.001004
eval: 0.001039
op: quantize v2
push: 0.000695
eval: 0.000611
op: Requantize
push: 0.000884
eval: 0.000851
Except matmul op, the cost of push is as same as eval.

@Knight-X
Copy link
Member Author

Knight-X commented Feb 8, 2019

class Allocator {
public:
template
S_TENSOR allocate(TName _name, std::vector<uint32_t> dim, const T* inline_const, std::string type);

  template<typename T>
  void release(TName _name);
  S_TENSOR getPtr(TName _name);
  template<typename T>
  void yield(TName _name);
  template<typename T>
  void occupy(TName _name);

protected:
std::unordered_map<TName, S_TENSOR> a; //store the whole tensor pointer
void* data; //whole memory buffer for storing all tensor data
};

Allocator::Allocator() {
data = malloc(sizeof(int//prepare whole bunch of memory));
));
}
template
S_TENSOR Allocator::allocate(TName _name, std::vector<uint32_t> dim, const gg* inline_const, std::string type) {
Tensor* t = nullptr;
if (a.find(_name) != a.end()) {
return a[_name];
}
if (type == "BinaryTensor")
t = new BinaryTensor(dim, inline_const);
S_TENSOR _sptr(t);
a[_name] = _sptr;
return _sptr;
}

template
void Allocator::release(TName _name) {
if (a.find(_name) != a.end()) {
a.erase(_name);
}
}

template
void Allocator::yield(TName _name) {
if (a.find(_name) != a.end()) {
//should access shared data
a[_name]->s->data = nullptr;
}
}
template
void Allocator::occupy(TName _name) {
if (a.find(_name) != a.end()) {
a[_name]->s->data = nullptr;//position of data;
}
}

S_TENSOR Allocator::getPtr(TName _name) {
return a[_name];
}
Just a allocator draft now, I am still working on this:

  1. Pre-allocate the memory buffer and using occupy and yield to transfer the ownership of memory block.
  2. Do not re-malloc the whole tensor in each context add.

still going:

  1. We need a interface to access raw data pointer for memory ownership transfer.
  2. Should we replace S_TENSOR by Tensor* ?
    @neil-tan @mbartling

@neil-tan
Copy link
Member

neil-tan commented Feb 8, 2019

  • <T> should eventually be #define T_name 123 and is generated by cgen @mbartling @dboyliao
  • std::unordered_map, consider an linearly sorted data structure. There will many many tensor look-ups during one inferencing cycle. This needs to be as efficient at possible.
  • try to do away the string class, this reduce the overall binary size
  • t = new BinaryTensor(dim, inline_const); it's fine to use heap alloc here, but we want to include a caching algorithm for performance reasons
  • Allocator::occupy(TName _name) seems to be the same as void Allocator::yield(TName _name). They are place holders for now?

Some thoughts:

  • please leave a clear path for supporting cgen-memory-management scheme
  • regarding to memory ownership transfer, please see if PR142 helps.
  • Question of using Tensor * or supporting the move-semantics with S_Tensor

@Knight-X

@Knight-X
Copy link
Member Author

Knight-X commented Feb 13, 2019

  • t = new BinaryTensor(dim, inline_const);That is why I try to propose the class serialization method. However, we should modify the tensor class because of virtual method.
  • Occpy method is aimed to acquire memory, but yield release memory

S_TENSOR is heavier than Tensor *. In order to reduce memory usage, should we consider to deprecate it?
As mentioned before, the purpose of allocator now is for activation data memory now. However, I would make it easy to extend for tensor data reuse.

@Knight-X
Copy link
Member Author

Knight-X commented Feb 14, 2019

class Allocator { 
 public:
      template<typename T>
      void allocate(TName _name, std::vector<uint32_t> dim, const T* inline_const, std::string type);    
      
      Allocator(int a); 
    template<typename T>
    void release(TName _name);

    S_TENSOR getPtr(TName _name);
    
    void yield(TName _name);

    void occupy(TName _name, const std::vector<uint32_t> &v, uint32_t offset);
protected:
    std::unordered_map<TName,S_TENSOR> tensor_pool; //store the whole tensor pointer
    uint8_t* data; //whole memory buffer for storing all tensor data 
};

Allocator::Allocator(int a) {
data = (uint8_t*)malloc(sizeof(uint8_t) * a//prepare whole bunch of memory));
); 
}

template<typename T>
void Allocator::allocate(TName _name, std::vector<uint32_t> dim, const T* inline_const, std::string type) {
  Tensor* t = nullptr;
  if (tensor_pool.find(_name) != tensor_pool.end()) {
      return;
  }
  if (type == "BinaryTensor") {
      t = new BinaryTensor<T>(dim, inline_const);
  } else if (type == "RamTensor") {
      t = new RamTensor<T>();
  }
  S_TENSOR _sptr(t);
  tensor_pool[_name] = _sptr;
}

template<typename T>
void Allocator::release(TName _name) {
  if (tensor_pool.find(_name) != tensor_pool.end()) {
      tensor_pool.erase(_name);
  }
}

void Allocator::yield(TName _name) {
  S_TENSOR tmp = tensor_pool[_name];
  void *raw_ptr = tmp->getRawPtr();
  if (tensor_pool.find(_name) != tensor_pool.end()) {
      raw_ptr = nullptr;
  }
}

void Allocator::occupy(TName _name, const std::vector<uint32_t>& v, uint32_t offset) {
  S_TENSOR tmp = tensor_pool[_name];
  uint32_t size = tmp->getSize();
  void* g = tmp->getRawPtr();
 data = data + offset;
  g = reinterpret_cast<void *>(data);
  data = data + size;
  
}

S_TENSOR Allocator::getPtr(TName _name) {
  return tensor_pool[_name];
} 

@mbartling
Copy link
Member

@Knight-X can you explain your thought process for Allocator, I am having a hard time wrapping my head around it.

@mbartling
Copy link
Member

I always imagined the MemAllocators as separate entities. Also things get really convenient if we define our own TensorId type instead of just strings:

class DefaultUtensorAllocator;

class TensorId {
    /** Some hashable tensor lookup metadata **/
    public:
    ...
    void * where() { return loc; }

    // Rest of interface
    uint32_t operator (uint32_t) () { return hash(); }
    ...

    protected:
    virtual uint32_t hash() = 0;
    
    private:
    void * _loc;

};

// may not need this but makes life convenient
template <typename T>
class uBlock {
    public:
    uBlock(TensorShape shape) {
        mini_buffer.reserve(shape.linear_space);
    }

    //This class just exposes a read write interface to cache blocks of a Tensor
    private:
    std::vector<T> mini_buffer;
};

// Add additional checks so this never gets created on the stack
template <typename Allocator=DefaultUtensorAllocator>
class Tensor : public uTensor {
    ...
    TensorId* tid;
    public:
    /**
        Use Tensor construction to register with the context class as this information is relatively small
        Only a handful of tensors in a graph
    */
    Tensor(TensorId* tid) : tid(tid) {
        Context::register_tensor(tid, this); //This can be reverse lookuped if necessary (returns a TensorId);
    }

    // Override new and delete so we can control where all tensor
    void* operator new(size_t size) {
        void* p = Allocator::allocate(size);
        return p;
    }
    void operator delete(void* ptr) {
        Allocator::deallocate(ptr);
    }

    protected:
    void associate_data(void* data){ this->data = data; }
};

// Easy peasy
class RomTensorId {
    public:
    RomTensorId(void* data) { where() = data; }
    uint32_t hash() { return (uint32_t) where(); }

};

class RomTensor : public  uTensor {
    RomTensor() {} // TensorId must be specified for RomTensors
    public:
    RomTensor(RomTensorId id, TensorShape shape) : ::Tensor(&id, shape) { 
        //Rom is easy sinced the TensorId _loc points to an address directly
        associate_data(id.where());
    }
    ...
};

class RamTensorId { 
    int id;
    public:
    RamTensorId(int id) : id(id) { where() = &id; } 
    uint32_t hash() { return (uint32_t) id; }
};

//Essentially Rom tensor except has RAM allocation bits
class RamTensor : public uTensor {
    private:

    public:
    RamTensor(RamTensorId, TensorShape shape) : ::Tensor(&id, shape) {
        associate_data(uTensorRamAllocator::allocate(id.hash(), shape.linear_space)); // uTensorRamAllocator returns a pointer to the data field associated with the tensor info
    }
    ~RamTensor() {
        uTensorRamAllocator::deallocate(id->hash());
};

/**
    Really this class doesnt even need to be this smart
*/
class DefaultUtensorAllocator {
    public:
        static void* allocate(size_t size) { //be dumb for now, but really should allocate based on requested size
            void* p = mem_cache.insert();
        }
        static void deallocate( void * key){
            mem_cache.remove(key);
        }
    private:

    /** modified hash returns key on insert */
    FixedEntryHeap<sizeof(Tensor), NUM_HEAP_ENTRIES> mem_cache; //Lookup Tensor*
};

@Knight-X
Copy link
Member Author

allocate.allocate("tensor_a", ...);
allocate.occupy("tensor_a", offset_value);
S_TENSOR s = allocate.getPtr(“tensor_a”)
allocate.yield()

Allocator maintains a bunch of memory buffer to allocate tensor data first and leaves the space for storing tensor object for optimization. During the codegen process, the memory offset is calculated and assigned to allocator to make buffer usage as efficient as possible. After a lifetime of a tensor, the allocator will make the data buffer free for other tensors. We just cache the tensor object itself now, but it also could be allocated from memory buffer for further optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants