A lightweight, high-performance tensor engine built from scratch in C++.
This project implements core deep learning components-from autograd to advanced optimizer-without any external libraries.
- Topological Sorting: Implements a DFS-based topological sort to ensure the correct order of operations during backpropagation.
- Memory Management: Utilizes C++
std::shared_ptrto manage the lifecycle of temporary tensors and prevent memory leaks in complex computational graphs.
void build_topo(shared_ptr<Tensor> v, vector<shared_ptr<Tensor>>& topo_list, set<shared_ptr<Tensor>>& visited) {
if (v == nullptr || visited.count(v) > 0) return;
visited.insert(v);
if(v->grad_fn) {
if(v->grad_fn->saved_tensors.size() > 0 && v->grad_fn->saved_tensors[0]->requires_grad) build_topo(v->grad_fn->saved_tensors[0], topo_list, visited);
if(v->grad_fn->saved_tensors.size() > 1 && v->grad_fn->saved_tensors[1]->requires_grad) build_topo(v->grad_fn->saved_tensors[1], topo_list, visited);
}
topo_list.push_back(v);
}- Cache-Friendly Design: Employs matrix transposition for the right-hand side (RHS) operand to achieve a higher cache hit rate during dot product operations.
class MatrixMultiplication : public Function {
public:
vector<int> res_strides;
shared_ptr<Tensor> forward(vector<shared_ptr<Tensor>> inputs) {
auto lhs = inputs[0];
auto rhs_T = transpose_(inputs[1]); // key points!
save_for_backward(lhs);
save_for_backward(rhs_T);
vector<int> res_shape = lhs->shape;
res_shape[res_shape.size()-1] = rhs_T->shape[0];
auto res = make_shared<Tensor>(res_shape, lhs->requires_grad||rhs_T->requires_grad);
saved_attrs.push_back(res->total_size);
res_strides = res->strides;
res->fill(0.0);
for(int i=0; i<res->total_size; i++) {
int remaining = i;
int lhs_idx = 0;
int rhs_T_idx = 0;
for(int j=0; j<res->shape.size()-1; j++) {
lhs_idx += remaining/res->strides[j] * lhs->strides[j];
remaining %= res->strides[j];
}
rhs_T_idx = remaining;
for(int k=0; k<lhs->shape[lhs->shape.size()-1]; k++) {
res->data[i] += lhs->data[lhs_idx+k] * rhs_T->data[rhs_T_idx*rhs_T->strides[0]+k];
}
}
return res;
}
...
};- Broadcasting: Seamlessly handles operations between tensors of different shapes.
- Modular Layers: Built-in layers like
Linear(FC),LeakyReLU, andSoftmax.
- Problem: The model consistently predicted zero, and all weights were found to be zero after initialization.
- Root Cause: During He Initialization, the standard deviation calculation used integer division (
2/n_in), resulting in asigmaof 0. - Solution: Explicitly cast literals to floating-point (
2.0 / n_in) to ensure precise weight distribution.
void he_initialization(shared_ptr<Tensor> params, int n_in) {
long double mu = 0.0;
long double sigma = sqrt(2.0/n_in);
normal_distribution<double> dist(mu, sigma);
for(int i=0; i<params->total_size; i++) {
params->data[i] = dist(gen);
}
}- Problem: Adding hidden layers caused the gradients to explode exponentially (NaN/Inf).
- Investigation: Debugging revealed that outputs from the first layer were being accumulated across batches. It appeared as if the memory was "leaking" previous values into current calculations.
- Solution: Identified that the
forwardpass ofMatrixMultiplicationlacked a Zero-initialization step for the result tensor. Addingres->fill(0.0)ensured a clean slate for every operation, stabilizing the entire network.
class MatrixMultiplication : public Function {
public:
shared_ptr<Tensor> forward(vector<shared_ptr<Tensor>> inputs) {
...
vector<int> res_shape = lhs->shape;
res_shape[res_shape.size()-1] = rhs_T->shape[0];
auto res = make_shared<Tensor>(res_shape, lhs->requires_grad||rhs_T->requires_grad);
res->fill(0.0);
...
}
...
};To achieve more stability in a custom engine, the following techniques were implemented:
- Batch Accumulation: Simulates larger batch sizes to improve gradient estimation.
- Gradient Clipping: Prevents gradient explosion by capping the norm of gradients.
- Adam Optimization: Adaptive learning rates for faster and more stable convergence.
-
Gaussian Noise Injection: Adds
$\sigma=0.05$ noise to training data to improve generalization and prevent overfitting.
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
FC1 [32] 25,120
LeakyReLU [32] 0
FC2 [10] 330
Softmax [10] 0
================================================================
Total params: 25,450
- Training Accuracy: 92.14%
- Test Accuracy: 91.01%
- Generalization Gap: 1.13% (Extremeley stable due to advanced techniques)
_graph.png)