[TF 2.0 API Docs] tf.custom_gradient #26270

tsbertalan · 2019-03-01T23:58:28Z

Please make sure that this is a documentation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:doc_template

System information

TensorFlow version: 1.13.1
Doc Link: https://www.tensorflow.org/api_docs/python/tf/custom_gradient

I've been having some difficulty using custom_gradient recently. Before implementing the op I actually have in mind, I'm trying to make a simple polynomial op with custom_gradient in an effort to get my head around the requirements.. The documentation is not at all clear on how summing over batch and output indices should be done, and (until recently) wasn't clear on whether we should return the full Jacobian of the operation, or just the vector-Jacobian-product ("VJP") function. (I did notice some explanation of the latter issue made it in in the 1.12 -> 1.13 update, which was helpful).

I think this problem applies both the derivative WRT the op's input, and WRT to the parameters (called grad_xs and grad_vars in the documentation). However, when I incorporated the VJP tidbit in my sample code, I got correct results for the former, but not the latter, I believe because the implicit sum over batch was already present in the supplied grad_ys, and because the VJP also collapses out the index across grad_ys.

So, the lack of clarity about sums (accidentally) only hit me when trying to write the grad_vars part. In my gist linked above, I had to use a reduce_sum when computing $dy/dp$, which, for a polynomial, is the corresponding powers of $x$. (Since $dy/dx$ doesn't depend on $x$, this issue didn't appear when writing the grad_xs part.)

In addition to simply saying what you mean in this documentation (the sum of the gradient over examples is not "the gradient"--only the per-example gradient truly deserves that name! But I suppose that particular fight is a lost cause.), it would be good if there were at least some examples that exercised the grad_vars part and the variables=None keyword argument.

dynamicwebpaige · 2019-03-02T22:17:50Z

@tsbertalan - Thank you so much for mentioning this, and for creating a gist as an example. I agree that the API documentation for tf.custom_gradient could be improved, and this is a great start. 🙂

A few things:

Would you like to make a PR to update the documentation for tf.custom_gradient? If yes, I'd be happy to point you to where to get started!
Would you be interested in converting your gist example to TensorFlow 2.0 style (ping @alextp), and including it as an example in tensorflow/examples?
I noticed that you're a post-doc at MIT; congratulations! Would you be interested in working on similar efforts as a Google Summer of Code project? TensorFlow has recently been added as a mentoring organization, and updating the API docs with detailed technical examples would be a great project.

Shashankjain12 · 2019-03-04T03:27:00Z

Can i also work on updation of docs for tf.custom_gradient !?

alextp · 2019-03-04T16:45:55Z

Thanks for the comments! I'd love to have improved documentation for tf.custom_gradient, and would love to review such a pull request.

One minor nit though is that when you say "the sum of the gradient over examples is not "the gradient"--only the per-example gradient truly deserves that name" I do not quite agree; TensorFlow only differentiates scalar loss functions, so "the gradient" is shorthand for "the gradient of the sum / average of per-example losses" and then because the gradient of a sum is the sum of the gradients then I think the text there is mostly correct.

That said I agree with you that correct is not enough and that we need to clarify these comments so others don't make the same mistake.

dsmic · 2019-08-08T10:22:29Z

My original post is below, this is only because it might help others trying to get custom_gradient to work with keras in more complicated situations...

OK, might be a documentation issue, but probably I was not reading enough documentation at all. If you use custom_gradient you are only allowed to use limited python statements, mainly tensorflow operations:
this gradient seems to work at least, maybe it helps others:


@tf.custom_gradient
def select_subnet_layer(x):
    # the last in x are the select, before is divided into parts
    x_select = x[:,:,-args.lstm_num:]
    x_data = x[:,:,:-args.lstm_num]
    print("x_select",x_select.shape)
    size_of_out = x_data.shape[2] // args.lstm_num
    out = x_data[:,:,:size_of_out]
    out = 0 * out
    for i in range(args.lstm_num):
        out += x_data[:,:,(i * size_of_out):((i+1)*size_of_out)]* x_select[:,:,i:i+1]
    print("out",out.shape)
    print("x",x.shape)
    def custom_grad(dy):
        size_of_out = (x.shape[2]-args.lstm_num) // args.lstm_num
        gg = []
        gs = []
        for i in range(args.lstm_num):
            gg.append(  x[:,:,size_of_out * args.lstm_num +i:size_of_out * args.lstm_num + i + 1] * dy)
            tmp =  x[:,:,size_of_out * i:size_of_out * (i+1)] * dy
            gs.append(keras.backend.sum(tmp, axis = 2, keepdims = True))
        grad = keras.backend.concatenate(gg + gs)
        print(gg,gs,grad)
        return grad # keras.backend.clip(grad,-1,1)
    return out, custom_grad

I am not sure, if I have a documentation related issue, or maybe it is a bug.

I created a custom layer and I need to add a custom gradient:

The problem is: My layer is included in a LSTM network and therefore has changing input and output shapes. Initially it has e.g. (?,?,153) e.g. which is (1, variable size, 153) during calculation.

my keras layer is defined as:

@tf.custom_gradient
def select_subnet_layer(x):
    global ssss
    # the last in x are the select, before is divided into parts
    x_select = x[:,:,-args.lstm_num:]
    x_data = x[:,:,:-args.lstm_num]
    print("x_select",x_select.shape)
    size_of_out = x_data.shape[2] // args.lstm_num
    out = x_data[:,:,:size_of_out]
    out = 0 * out
    for i in range(args.lstm_num):
        out += x_data[:,:,(i * size_of_out):((i+1)*size_of_out)]* x_select[:,:,i:i+1]
    print("out",out.shape)
    print("x",x.shape)
    def custom_grad(dy):
        print('debugging',dy)
        s1 = dy.shape.as_list()[0]
        s2 = dy.shape.as_list()[1]
        print(dy,[dy])
        if s1 is None:
            return tf.fill((1, 324, size_of_out*args.lstm_num + args.lstm_num), 1.0)
        grad_nump = np.ones([s1,s2,153], dtype='float32')
        if x.shape.as_list()[0] is not None:
            for i in range(args.lstm_num):
                print('???',args.lstm_num)
                grad_nump[:,:, size_of_out*i : size_of_out*(i+1)] = x[:,:,size_of_out * args.lstm_num + i]
                grad_nump[:,:,size_of_out * args + i] = np.sum(x[:,:,size_of_out*i : size_of_out*(i+1)])
        grad = tf.convert_to_tensor(grad_nump)
        return grad
    return out, custom_grad

class CustomLayer(Layer):

    def __init__(self, **kwargs):

        super(CustomLayer, self).__init__(**kwargs)

    def call(self, x):
        return select_subnet_layer(x[:,:])  # you don't need to explicitly define the custom gradient

    def compute_output_shape(self, input_shape):
        print(input_shape[2])
        return (input_shape[0], input_shape[1], (int(input_shape[2]) - args.lstm_num ) // args.lstm_num)

if s1 is None: is not what I want. I tried to return a tf tensor placeholder with correct dimensions, but this dimension is not changed to a correct one during evaluation of the net too.

Later it does not change the shape?! Therefore I get the Error

InvalidArgumentError: shape of dy was [1,324,153] instead of [1,262,153]
[[{{node training_1/RMSprop/gradients/custom_layer_3/strided_slice_grad/StridedSliceGrad}}]]
[[loss_2/mul/_183]]

when I do have a changed dimension in one training example.

I have the full source of the file on github:
https://github.com/dsmic/LearnMultiplyByHand/blob/5e550ae6435c623c325889d8d5a28d65a12a092a/learnmultiply_schriftlich_limit_traindata_subnets.py

I would love to get any suggestion what the problem might be, and of cause would help in case it is a bug to fix.

lsgrep · 2019-10-29T10:31:17Z

Is this still open? If so, I would like to take on this.

dsmic · 2019-10-29T12:10:22Z

For me it is closed, I learned to use only limited python statements, but did not find any detailed documentation, what is allowed....

bharatnishant · 2019-11-09T16:07:27Z

@tsbertalan @dynamicwebpaige Has the issue been resolved?
If not I would like to improve it

tsbertalan · 2019-11-13T00:14:32Z

@bharatnishant I think, actually, the documentation that needs this clarification first is that for the regular tf.gradients function. It does claim, from the very start, that it's for computing "symbolic derivatives of sum of ys w.r.t. x in xs." However, there appear to be cases in which this is not enforced, with confusing interactions between the presense of a batch dimension in either ys or xs.

I'd prefer more predictable behavior, with an exception raised rather than an automatic sum in the cases where a Jacobian-like computation is not possible, but, failing that, it seems reasonable at least to have documentation of when to expect that such an automatic sum will be added silently. As with convolution, we should be able to predict a-priori the shape of the gradient tensor(s) for all the possible input shapes for x(s) and ys, including batch dimensions in either.

@alextp I get your point about how tf.gradients is only really intended for scalar functions. There would be no need for this to even be a shorthand in my ideal world where exceptions were raised when ys was not a scalar. However, in reality, there appear to be circumstances where the resulting tensor will provide unaggregated gradients $dL_i/d\theta_j$, where $i$ is the batch index, and $j$ is the index across this particular parameter vector.

I think that a distinction should be made (and, in some fashion, is made) between a batch index and a true index, where computation for each batch index value is conceptually independent (except for some final aggregation), whereas true indices interact in everyday linear algebraic computations. I think the point about tf.gradients being for scalars applies more to the second type of index than the first--a reasonable terminology would be to speak of unaggregated gradients (which are d/dtheta of a scalar, evaluated per-entry in some dataset), and then change to the term "(un)aggregated Jacobians" when discussing derivatives of vectors per-datum. Also, note here, that I think it makes sense to describe tensors with shapes of both () and (None,) as "scalar".

For that matter, it would be nice to know what the real performance tradeoffs are for computing a Jacobian. Another issue
discusses some methods for doing that, but do these scale, say, linearly in the output dimension M? That is, if you formulate this is the stacking of the gradients of M several scalar-valued functions, does it in fact just take M times longer to compute that? How does this change if you use the experimental tf.python.ops.parallel_for.gradients.jacobian? I believe it does raise warnings about space usage, and, in some circumstances, the "Jacobian" returned can have two batch dimensions i and j, and an enormous number of zero entries for all the cases where i!=j.

As for clarification of the tf.custom_gradients documentation, I understand that, in deriving backpropagation as the chain rule, several Jacobian-vector products emerge that can can be computed more efficiently as such rather than as explicit full Jacobian evaluations followed by matrix-vector products. However, it is not totally clear to me how this squares with the terminology of "initial value gradients", "received gradients", or sometimes "upstream gradients" that is used to refer to the argument of the wrapped function. A more mathematical treatment would be very welcome, showing the complete chain rule for a simple expression with proper equivalences drawn between the mathematical expressions and the code objects/arguments.

I also see now that this is tagged as API v 2, though I'm still using the 1.x versions. I think that tf.gradients and tf.custom_gradient at least have similar documentation between the two versions, but in v2, there might be additional complications surrounding the "gradient tape" paradigm.

Sorry for the long-winded response. TensorFlow has been in a confusing state of flux over the past few versions, and so I (obviously) have many points of confusion on this.

tanmoyio · 2019-11-27T16:00:05Z

I think this issue is still open. If it is, I would like to work on it and improve the documentation.

Saumil-Agarwal · 2020-01-03T12:05:41Z

@tsbertalan @dynamicwebpaige @alextp Since the issue is still open, can I work on it and do the required changes in documentation?

alextp · 2020-01-06T03:17:58Z

Sure, please send a PR!

…

On Fri, Jan 3, 2020 at 4:05 AM Saumil-Agarwal ***@***.***> wrote: @tsbertalan <https://github.com/tsbertalan> @dynamicwebpaige <https://github.com/dynamicwebpaige> @alextp <https://github.com/alextp> Since the issue is still open, can I work on it and do the required changes in documentation? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26270?email_source=notifications&email_token=AAABHRO2AEAY7EZOOOUAH7DQ34SZVA5CNFSM4G3ICUN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA7TRY#issuecomment-570554823>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABHROYGTKNGE2RREF4AGDQ34SZVANCNFSM4G3ICUNQ> .

-- - Alex

ziofil · 2020-03-09T14:15:36Z

bump

hfahrudin · 2020-08-07T13:02:53Z

is this issues still occur?

Harsh188 · 2020-08-14T15:15:24Z

Hello, it's been quite some time since the issue has been open and looks like there aren't any active linked PR. If this issue is still open I will begin to work on it.
I'm not too aware of the functionality of the custom_gradient function. However, judging by the comments made by @tsbertalan and @alextp the following changes have to be made to the documentation:

The document should clarify that under certain "cases an automatic sum will be added silently".
Clearing up the differences between "a batch index and a true index"

Feel free to correct me if I'm wrong as my knowledge of gradients is fairly limited. Since it is also my first time working on an open-source issue I would greatly appreciate it if anyone could point me towards the guidelines for PR and making documentation changes.

AmitSharma1127 · 2020-10-16T14:16:49Z

@Harsh188 I am open for this or any other type of work that needs improvement. I am just begining out my journey as an open-source cotributor and hence I would appreciate any pointers or useful tips in order to get started as a contributor.

Please connect with me if you find a suitable time for the same.

adhilcodes · 2020-10-23T11:18:02Z

I am beginner and I would like to improve it, please help meto fix it

saad-sifar · 2020-10-26T16:26:00Z

Same here, I am also a beginner and would like to contribute. Is there any places where I can help?

suryagowda · 2020-10-30T04:08:41Z

Hey there,
I am new to open source environment,and i would like to contribute.Can anybody explain me 'How to get started?'

ziofil · 2020-10-30T11:40:34Z

I'm not sure if this is what you're looking for, but the first thing is to clone the repository (git clone [url of repo]). Then you make a new branch in which you will develop a new feature or fix a bug. Finally, when the work is done you open a pull request here on GitHub to have the new code reviewed and eventually merged.

tsbertalan changed the title ~~Documentation for `custom_gradient` is still not correct.~~ Documentation for `custom_gradient` is still not complete. Mar 2, 2019

dynamicwebpaige added type:docs-bug TF 2.0 labels Mar 2, 2019

dynamicwebpaige self-assigned this Mar 2, 2019

dynamicwebpaige changed the title ~~Documentation for `custom_gradient` is still not complete.~~ [TF 2.0 API Docs] tf.custom_gradient Mar 2, 2019

dynamicwebpaige added the good first issue label Mar 4, 2019

jvishnuvardhan added the comp:ops label May 31, 2019

yashk2810 added the stat:contributions welcome label Jan 29, 2020

Oct	NOV	Dec
	03
2019	2020	2021

tensorflow / tensorflow

[TF 2.0 API Docs] tf.custom_gradient #26270

[TF 2.0 API Docs] tf.custom_gradient #26270

tsbertalan commented Mar 1, 2019

dynamicwebpaige commented Mar 2, 2019 •

edited

Shashankjain12 commented Mar 4, 2019

alextp commented Mar 4, 2019

dsmic commented Aug 8, 2019 •

edited

lsgrep commented Oct 29, 2019

dsmic commented Oct 29, 2019

bharatnishant commented Nov 9, 2019

tsbertalan commented Nov 13, 2019

tanmoyio commented Nov 27, 2019

Saumil-Agarwal commented Jan 3, 2020

alextp commented Jan 6, 2020

ziofil commented Mar 9, 2020

hfahrudin commented Aug 7, 2020

Harsh188 commented Aug 14, 2020 •

edited

AmitSharma1127 commented Oct 16, 2020

adhilcodes commented Oct 23, 2020

saad-sifar commented Oct 26, 2020

suryagowda commented Oct 30, 2020

ziofil commented Oct 30, 2020

tensorflow / tensorflow

Join GitHub today

[TF 2.0 API Docs] tf.custom_gradient #26270

[TF 2.0 API Docs] tf.custom_gradient #26270

Comments

tsbertalan commented Mar 1, 2019

dynamicwebpaige commented Mar 2, 2019 • edited

Shashankjain12 commented Mar 4, 2019

alextp commented Mar 4, 2019

dsmic commented Aug 8, 2019 • edited

lsgrep commented Oct 29, 2019

dsmic commented Oct 29, 2019

bharatnishant commented Nov 9, 2019

tsbertalan commented Nov 13, 2019

tanmoyio commented Nov 27, 2019

Saumil-Agarwal commented Jan 3, 2020

alextp commented Jan 6, 2020

ziofil commented Mar 9, 2020

hfahrudin commented Aug 7, 2020

Harsh188 commented Aug 14, 2020 • edited

AmitSharma1127 commented Oct 16, 2020

adhilcodes commented Oct 23, 2020

saad-sifar commented Oct 26, 2020

suryagowda commented Oct 30, 2020

ziofil commented Oct 30, 2020

Essential cookies

Always active

Analytics cookies

dynamicwebpaige commented Mar 2, 2019 •

edited

dsmic commented Aug 8, 2019 •

edited

Harsh188 commented Aug 14, 2020 •

edited