![]() ![]() activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, and writes once, gradInput). Activations are usually bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward (e.g. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs.įor convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Then your software could have special memory needs. Therefore when coding it’s crucial to think strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.Īdditionally there are all kinds of temporary variables which get released once the calculation is done, but in the moment these could require additional memory and could push to OOM. size depends on many factors, the key ones being sequence length, hidden size and batch size.4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32).4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state).2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |