The Wayback Machine - https://web.archive.org/web/20201103033137/https://github.com/tensorflow/tensorflow/issues/26270
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF 2.0 API Docs] tf.custom_gradient #26270

Open
tsbertalan opened this issue Mar 1, 2019 · 19 comments
Open

[TF 2.0 API Docs] tf.custom_gradient #26270

tsbertalan opened this issue Mar 1, 2019 · 19 comments

Comments

@tsbertalan
Copy link

@tsbertalan tsbertalan commented Mar 1, 2019

Please make sure that this is a documentation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:doc_template

System information

I've been having some difficulty using custom_gradient recently. Before implementing the op I actually have in mind, I'm trying to make a simple polynomial op with custom_gradient in an effort to get my head around the requirements.. The documentation is not at all clear on how summing over batch and output indices should be done, and (until recently) wasn't clear on whether we should return the full Jacobian of the operation, or just the vector-Jacobian-product ("VJP") function. (I did notice some explanation of the latter issue made it in in the 1.12 -> 1.13 update, which was helpful).

I think this problem applies both the derivative WRT the op's input, and WRT to the parameters (called grad_xs and grad_vars in the documentation). However, when I incorporated the VJP tidbit in my sample code, I got correct results for the former, but not the latter, I believe because the implicit sum over batch was already present in the supplied grad_ys, and because the VJP also collapses out the index across grad_ys.

So, the lack of clarity about sums (accidentally) only hit me when trying to write the grad_vars part. In my gist linked above, I had to use a reduce_sum when computing $dy/dp$, which, for a polynomial, is the corresponding powers of $x$. (Since $dy/dx$ doesn't depend on $x$, this issue didn't appear when writing the grad_xs part.)

In addition to simply saying what you mean in this documentation (the sum of the gradient over examples is not "the gradient"--only the per-example gradient truly deserves that name! But I suppose that particular fight is a lost cause.), it would be good if there were at least some examples that exercised the grad_vars part and the variables=None keyword argument.

@tsbertalan tsbertalan changed the title Documentation for `custom_gradient` is still not correct. Documentation for `custom_gradient` is still not complete. Mar 2, 2019
@dynamicwebpaige dynamicwebpaige self-assigned this Mar 2, 2019
@dynamicwebpaige dynamicwebpaige changed the title Documentation for `custom_gradient` is still not complete. [TF 2.0 API Docs] tf.custom_gradient Mar 2, 2019
@dynamicwebpaige
Copy link
Member

@dynamicwebpaige dynamicwebpaige commented Mar 2, 2019

@tsbertalan - Thank you so much for mentioning this, and for creating a gist as an example. I agree that the API documentation for tf.custom_gradient could be improved, and this is a great start. 🙂

A few things:

  • Would you like to make a PR to update the documentation for tf.custom_gradient? If yes, I'd be happy to point you to where to get started!

  • Would you be interested in converting your gist example to TensorFlow 2.0 style (ping @alextp), and including it as an example in tensorflow/examples?

  • I noticed that you're a post-doc at MIT; congratulations! Would you be interested in working on similar efforts as a Google Summer of Code project? TensorFlow has recently been added as a mentoring organization, and updating the API docs with detailed technical examples would be a great project.

@Shashankjain12
Copy link

@Shashankjain12 Shashankjain12 commented Mar 4, 2019

Can i also work on updation of docs for tf.custom_gradient !?

@alextp
Copy link
Contributor

@alextp alextp commented Mar 4, 2019

Thanks for the comments! I'd love to have improved documentation for tf.custom_gradient, and would love to review such a pull request.

One minor nit though is that when you say "the sum of the gradient over examples is not "the gradient"--only the per-example gradient truly deserves that name" I do not quite agree; TensorFlow only differentiates scalar loss functions, so "the gradient" is shorthand for "the gradient of the sum / average of per-example losses" and then because the gradient of a sum is the sum of the gradients then I think the text there is mostly correct.

That said I agree with you that correct is not enough and that we need to clarify these comments so others don't make the same mistake.

@dsmic
Copy link

@dsmic dsmic commented Aug 8, 2019

My original post is below, this is only because it might help others trying to get custom_gradient to work with keras in more complicated situations...

OK, might be a documentation issue, but probably I was not reading enough documentation at all. If you use custom_gradient you are only allowed to use limited python statements, mainly tensorflow operations:
this gradient seems to work at least, maybe it helps others:


@tf.custom_gradient
def select_subnet_layer(x):
    # the last in x are the select, before is divided into parts
    x_select = x[:,:,-args.lstm_num:]
    x_data = x[:,:,:-args.lstm_num]
    print("x_select",x_select.shape)
    size_of_out = x_data.shape[2] // args.lstm_num
    out = x_data[:,:,:size_of_out]
    out = 0 * out
    for i in range(args.lstm_num):
        out += x_data[:,:,(i * size_of_out):((i+1)*size_of_out)]* x_select[:,:,i:i+1]
    print("out",out.shape)
    print("x",x.shape)
    def custom_grad(dy):
        size_of_out = (x.shape[2]-args.lstm_num) // args.lstm_num
        gg = []
        gs = []
        for i in range(args.lstm_num):
            gg.append(  x[:,:,size_of_out * args.lstm_num +i:size_of_out * args.lstm_num + i + 1] * dy)
            tmp =  x[:,:,size_of_out * i:size_of_out * (i+1)] * dy
            gs.append(keras.backend.sum(tmp, axis = 2, keepdims = True))
        grad = keras.backend.concatenate(gg + gs)
        print(gg,gs,grad)
        return grad # keras.backend.clip(grad,-1,1)
    return out, custom_grad

I am not sure, if I have a documentation related issue, or maybe it is a bug.

I created a custom layer and I need to add a custom gradient:

The problem is: My layer is included in a LSTM network and therefore has changing input and output shapes. Initially it has e.g. (?,?,153) e.g. which is (1, variable size, 153) during calculation.

my keras layer is defined as:

@tf.custom_gradient
def select_subnet_layer(x):
    global ssss
    # the last in x are the select, before is divided into parts
    x_select = x[:,:,-args.lstm_num:]
    x_data = x[:,:,:-args.lstm_num]
    print("x_select",x_select.shape)
    size_of_out = x_data.shape[2] // args.lstm_num
    out = x_data[:,:,:size_of_out]
    out = 0 * out
    for i in range(args.lstm_num):
        out += x_data[:,:,(i * size_of_out):((i+1)*size_of_out)]* x_select[:,:,i:i+1]
    print("out",out.shape)
    print("x",x.shape)
    def custom_grad(dy):
        print('debugging',dy)
        s1 = dy.shape.as_list()[0]
        s2 = dy.shape.as_list()[1]
        print(dy,[dy])
        if s1 is None:
            return tf.fill((1, 324, size_of_out*args.lstm_num + args.lstm_num), 1.0)
        grad_nump = np.ones([s1,s2,153], dtype='float32')
        if x.shape.as_list()[0] is not None:
            for i in range(args.lstm_num):
                print('???',args.lstm_num)
                grad_nump[:,:, size_of_out*i : size_of_out*(i+1)] = x[:,:,size_of_out * args.lstm_num + i]
                grad_nump[:,:,size_of_out * args + i] = np.sum(x[:,:,size_of_out*i : size_of_out*(i+1)])
        grad = tf.convert_to_tensor(grad_nump)
        return grad
    return out, custom_grad

class CustomLayer(Layer):

    def __init__(self, **kwargs):

        super(CustomLayer, self).__init__(**kwargs)

    def call(self, x):
        return select_subnet_layer(x[:,:])  # you don't need to explicitly define the custom gradient

    def compute_output_shape(self, input_shape):
        print(input_shape[2])
        return (input_shape[0], input_shape[1], (int(input_shape[2]) - args.lstm_num ) // args.lstm_num)

if s1 is None: is not what I want. I tried to return a tf tensor placeholder with correct dimensions, but this dimension is not changed to a correct one during evaluation of the net too.

Later it does not change the shape?! Therefore I get the Error

InvalidArgumentError: shape of dy was [1,324,153] instead of [1,262,153]
[[{{node training_1/RMSprop/gradients/custom_layer_3/strided_slice_grad/StridedSliceGrad}}]]
[[loss_2/mul/_183]]

when I do have a changed dimension in one training example.

I have the full source of the file on github:
https://github.com/dsmic/LearnMultiplyByHand/blob/5e550ae6435c623c325889d8d5a28d65a12a092a/learnmultiply_schriftlich_limit_traindata_subnets.py

I would love to get any suggestion what the problem might be, and of cause would help in case it is a bug to fix.

@lsgrep
Copy link
Contributor

@lsgrep lsgrep commented Oct 29, 2019

Is this still open? If so, I would like to take on this.

@dsmic
Copy link

@dsmic dsmic commented Oct 29, 2019

For me it is closed, I learned to use only limited python statements, but did not find any detailed documentation, what is allowed....

@bharatnishant
Copy link

@bharatnishant bharatnishant commented Nov 9, 2019

@tsbertalan @dynamicwebpaige Has the issue been resolved?
If not I would like to improve it

@tsbertalan
Copy link
Author

@tsbertalan tsbertalan commented Nov 13, 2019

@bharatnishant I think, actually, the documentation that needs this clarification first is that for the regular tf.gradients function. It does claim, from the very start, that it's for computing "symbolic derivatives of sum of ys w.r.t. x in xs." However, there appear to be cases in which this is not enforced, with confusing interactions between the presense of a batch dimension in either ys or xs.

I'd prefer more predictable behavior, with an exception raised rather than an automatic sum in the cases where a Jacobian-like computation is not possible, but, failing that, it seems reasonable at least to have documentation of when to expect that such an automatic sum will be added silently. As with convolution, we should be able to predict a-priori the shape of the gradient tensor(s) for all the possible input shapes for x(s) and ys, including batch dimensions in either.

@alextp I get your point about how tf.gradients is only really intended for scalar functions. There would be no need for this to even be a shorthand in my ideal world where exceptions were raised when ys was not a scalar. However, in reality, there appear to be circumstances where the resulting tensor will provide unaggregated gradients $dL_i/d\theta_j$, where $i$ is the batch index, and $j$ is the index across this particular parameter vector.

I think that a distinction should be made (and, in some fashion, is made) between a batch index and a true index, where computation for each batch index value is conceptually independent (except for some final aggregation), whereas true indices interact in everyday linear algebraic computations. I think the point about tf.gradients being for scalars applies more to the second type of index than the first--a reasonable terminology would be to speak of unaggregated gradients (which are d/dtheta of a scalar, evaluated per-entry in some dataset), and then change to the term "(un)aggregated Jacobians" when discussing derivatives of vectors per-datum. Also, note here, that I think it makes sense to describe tensors with shapes of both () and (None,) as "scalar".

For that matter, it would be nice to know what the real performance tradeoffs are for computing a Jacobian. Another issue
discusses some methods for doing that, but do these scale, say, linearly in the output dimension M? That is, if you formulate this is the stacking of the gradients of M several scalar-valued functions, does it in fact just take M times longer to compute that? How does this change if you use the experimental tf.python.ops.parallel_for.gradients.jacobian? I believe it does raise warnings about space usage, and, in some circumstances, the "Jacobian" returned can have two batch dimensions i and j, and an enormous number of zero entries for all the cases where i!=j.

As for clarification of the tf.custom_gradients documentation, I understand that, in deriving backpropagation as the chain rule, several Jacobian-vector products emerge that can can be computed more efficiently as such rather than as explicit full Jacobian evaluations followed by matrix-vector products. However, it is not totally clear to me how this squares with the terminology of "initial value gradients", "received gradients", or sometimes "upstream gradients" that is used to refer to the argument of the wrapped function. A more mathematical treatment would be very welcome, showing the complete chain rule for a simple expression with proper equivalences drawn between the mathematical expressions and the code objects/arguments.

I also see now that this is tagged as API v 2, though I'm still using the 1.x versions. I think that tf.gradients and tf.custom_gradient at least have similar documentation between the two versions, but in v2, there might be additional complications surrounding the "gradient tape" paradigm.

Sorry for the long-winded response. TensorFlow has been in a confusing state of flux over the past few versions, and so I (obviously) have many points of confusion on this.

@tanmoyio
Copy link

@tanmoyio tanmoyio commented Nov 27, 2019

I think this issue is still open. If it is, I would like to work on it and improve the documentation.

@Saumil-Agarwal
Copy link

@Saumil-Agarwal Saumil-Agarwal commented Jan 3, 2020

@tsbertalan @dynamicwebpaige @alextp Since the issue is still open, can I work on it and do the required changes in documentation?

@alextp
Copy link
Contributor

@alextp alextp commented Jan 6, 2020

@ziofil
Copy link

@ziofil ziofil commented Mar 9, 2020

bump

@hfahrudin
Copy link

@hfahrudin hfahrudin commented Aug 7, 2020

is this issues still occur?

@Harsh188
Copy link
Contributor

@Harsh188 Harsh188 commented Aug 14, 2020

Hello, it's been quite some time since the issue has been open and looks like there aren't any active linked PR. If this issue is still open I will begin to work on it.
I'm not too aware of the functionality of the custom_gradient function. However, judging by the comments made by @tsbertalan and @alextp the following changes have to be made to the documentation:

  • The document should clarify that under certain "cases an automatic sum will be added silently".
  • Clearing up the differences between "a batch index and a true index"

Feel free to correct me if I'm wrong as my knowledge of gradients is fairly limited. Since it is also my first time working on an open-source issue I would greatly appreciate it if anyone could point me towards the guidelines for PR and making documentation changes.

@AmitSharma1127
Copy link

@AmitSharma1127 AmitSharma1127 commented Oct 16, 2020

@Harsh188 I am open for this or any other type of work that needs improvement. I am just begining out my journey as an open-source cotributor and hence I would appreciate any pointers or useful tips in order to get started as a contributor.

Please connect with me if you find a suitable time for the same.

@adhilcodes
Copy link

@adhilcodes adhilcodes commented Oct 23, 2020

I am beginner and I would like to improve it, please help meto fix it

@saad-sifar
Copy link

@saad-sifar saad-sifar commented Oct 26, 2020

Same here, I am also a beginner and would like to contribute. Is there any places where I can help?

@suryagowda
Copy link

@suryagowda suryagowda commented Oct 30, 2020

Hey there,
I am new to open source environment,and i would like to contribute.Can anybody explain me 'How to get started?'

@ziofil
Copy link

@ziofil ziofil commented Oct 30, 2020

I'm not sure if this is what you're looking for, but the first thing is to clone the repository (git clone [url of repo]). Then you make a new branch in which you will develop a new feature or fix a bug. Finally, when the work is done you open a pull request here on GitHub to have the new code reviewed and eventually merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.