The Wayback Machine - https://web.archive.org/web/20210216145543/https://gitter.im/scikit-learn/scikit-learn

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 14:53
    jjerphan synchronize #17169
  • 14:50
    bdpedigo edited #19468
  • 14:50
    bdpedigo opened #19468
  • 14:07
    reshamas commented #19459
  • 14:06
    reshamas commented #19459
  • 14:04
    reshamas commented #19459
  • 14:00
    sobkevich synchronize #19361
  • 13:51
    sobkevich synchronize #19361
  • 10:07
    cmarmo unlabeled #17169
  • 10:07
    cmarmo labeled #17169
  • 08:43
    ogrisel synchronize #19415
  • 07:45
    cinbez commented #19370
  • 07:38
    jjerphan synchronize #17169
  • 06:58
    Bougeant commented #18849
  • Feb 15 22:02
    rth commented #19463
  • Feb 15 21:56
    GaelVaroquaux commented #19463
  • Feb 15 21:49
    angelotc commented #11915
  • Feb 15 21:21
    ogrisel synchronize #19415
  • Feb 15 21:08
    ogrisel commented #19415
  • Feb 15 20:28
    adriangb commented #19073
Muhammad Yasir
@SyedMuhamadYasir
so you can understand what i am trying to do exactly
I am trying to do a Feature Selection / Feature Reduction task
since the original dataset, without any preprocessing and with ALL features, gives near perfect results, it will annul the validity of using any type of Feature Reduction techniques
i will try out what you said in your answer, but i thought that it would be better to let you know the entire context and the actual reason for why exactly we need 'bad' results before applying any feature reduction
Guillaume Lemaitre
@glemaitre
It might still allow you to fit and predict faster and this probably what feature selection is best at.
Muhammad Yasir
@SyedMuhamadYasir

It might still allow you to fit and predict faster and this probably what feature selection is best at.

that .. is actually a very good point, thanks !

@glemaitre Je vous remercie :)
Mateusz Gałażyn
@carbolymer
hi, quick question, is there a preprocessor in sci-kit that allows me to split a feature into two?
Mateusz Gałażyn
@carbolymer
ok, there's column transformer, I can use that
lesshaste
@lesshaste
has anyone started a PR on dirichlet calibration of probabilities?
I couldn't find it if they have
Adrin Jalali
@adrinjalali
That's rather a new paper, and I'm not sure if it passes our inclusion criteria.
Amanda Dsouza
@amy12xx
I am getting some unrelated errors (azure pipeline stack trace) for a PR I just pushed. Anyone who can check and let me know how I can fix? scikit-learn/scikit-learn#19387
lesshaste
@lesshaste
@adrinjalali yes. I was just wondering if anyone thought it was interesting.
rohanishervin
@rohanishervin
Why column transformer convert datatype to objects after calling fit_transform?
3 replies
benny
@benny:michael-enders.com
[m]
I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?
1 reply
Guillaume Lemaitre
@glemaitre
@benny Stuff based on distances then
benny
@benny:michael-enders.com
[m]
can you give some examples?
Sharyar Memon
@sharyar
Hi everyone, I am not certain if this is the right place to ask. I am a first-time contributor. I love the library and it has helped me immensely in my studies so far. I was hoping to work on this issue as my first issue: scikit-learn/scikit-learn#18338
As far as I can understand, this issue requires that the documentation be updated, does that indicate the docstring within the function definition only, or is that referring to another piece of documentation?
One of the commentators on the issue also mentions ensuring there are tests that break if this documentation doesn't exist, how do I go about doing that effectively?
Sharyar Memon
@sharyar

I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?

I think it will depend on the data set. It also depends on how you are pre-processing your data. So kinda hard to say without knowing more.

lesshaste
@lesshaste
when I do OrdinalEncoder on my matrix X how can I make the mapping the same for each column?
currently it is different if there is one new value in a column that doesn't occur in another column
William Gacquer
@citron
Hello Happy scikit-learners !
I need some help please.
I want to serve an onnx model.
Input = 144 columns ( medical records, some categoricals, some not ).
Output = classification.
Pipeline = StandardScaler + LabelEncoder + LightGBM.
I am stuck with the LabelEncoder. Any example of such configuration somewhere ? Google was not my friend.
I was able to produce an onnx model when bypassing the LabelEncoder... but I need it and want to avoid 1HE because LightGBM performs much better without 1HE.
Anyone ?
rthy
@rthy:matrix.org
[m]
@citron You probably want OneHotEncoder not the LabelEncoder
Also tree based models it's better to use OrdinalEncoder instead for categorical features
Nicolas Hug
@NicolasHug

Also tree based models it's better to use OrdinalEncoder instead for categorical features

I'm not sure that's true, using OE will make the trees treat categories as ordered values, but they're not. Native categorical support (as in LightGBM) properly treats categories as un-ordered and can yield the same splits with less tree depth

rthy
@rthy:matrix.org
[m]
Yes, you are right. I guess I'm too used to scikit-learn tree based models not having native categorical support )
Olivier Grisel
@ogrisel
I agree with @NicolasHug in theory, but in practice the difference with OrdinalEncoder (with tuned hyperparams) is typically negligible ;)
@citron Using OrdinalEncoder is probably the pragmatic solution. OneHotEncoder is only efficient if you use sparse output which are currently not supported by ONNX as far as I know.
Xavier Dupré
@xadupre
@citron: what's the issue with LabelEncoder and ONNX? (I'm the main author of sklearn-onnx).
Olivier Grisel
@ogrisel

@citron also you said "Pipeline = StandardScaler + LabelEncoder + LightGBM." but I assume you use a column transformer to separate to only scale the numerical features and encode the categorical feature separately: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

BTW, StandardScaling the numerical features if often useless for tree-based models in general, and even more so for implementations such as LightGBM than bin the features.

William Gacquer
@citron
Bonjour @xadupre , @ogrisel, @rthy:matrix.org, @NicolasHug . Yes I do use a ColumnTransformer. Maybe I should better express my needs. The training set is made of 300000 rows. Colums types are either floating points, integers ( and sadly Pandas does not provide the R Dataframe handling of N/A ), booleans, categories or list of categories. For instance, some category columns may have 2 or 10 numerical categories, some only have "string" categories, some have a list of medicaments or a list of pathologies.
I have tried plenty of frameworks and among them, lightGBM was the best. Now, as I need to export the model and the pipeline in ONNX/ONNX-ML format, I need to wrap lightGBM in something to keep the pipeline around.
Olivier Grisel
@ogrisel
pandas 1.0 and later has support for explicit missing values in integer columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
scikit-learn however will convert this to a float anyway (but no big deal).
William Gacquer
@citron
@ogrisel Yes, no problem with pure int columns.
Olivier Grisel
@ogrisel

For the categorical columns, try to use OrdinalEncoder. In 0.24+ we have better support for unknown categories at test time:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

Although I am not sure that sklearn-onnx has replicated that feature yet.

William Gacquer
@citron
@ogrisel maybe @xadupre knows ?
Olivier Grisel
@ogrisel
If you have specific problems exporting a pipeline with OrdinalEncoder to onnx, better report the exact error message with a simple reproduction case to https://github.com/onnx/sklearn-onnx
Xavier Dupré
@xadupre
I wrote this example about converting a pipeline including a lightgbm model in a scikit-learn pipeline: http://onnx.ai/sklearn-onnx/auto_tutorial/plot_gexternal_lightgbm.html.
William Gacquer
@citron
@xadupre Thanks! The binder link at the end of the page has a problem.
In fact, that's the example I started with. Works fine without labelEncoder.
William Gacquer
@citron
I forgot to tell an important thing : I do use FLAML to select the best hyperparameters and thus the best model.
Xavier Dupré
@xadupre
I'll investigate the issue with LabelEncoder then. What is the error you get?
Loïc Estève
@lesteve
I think it would be a good idea to encourage creating a Github Discussion (rather than gitter) for anything else than simple questions/answers: https://github.com/scikit-learn/scikit-learn/discussions/new. gitter is not properly indexed by search engines so it is not a great use of time for people who answer questions.
I agree that "simple qestion/answer" does not have a very-well defined boundary but in the case of @citron's questions I think we have crossed this boundary a long time ago ...
William Gacquer
@citron
@lesteve I understand and agree.
Loïc Estève
@lesteve
@citron then if you find the time maybe create a Github Discussion and post the link in the gitter so that the discussion can continue in the Github Discussion?
SmellySalami
@SmellySalami
Hello guys! Me and my friends are looking to tackle some open issues on scikit-learn soon. We're very new so I would love a high level overview of the architecture.
Can anyone help or point to some resources?
rthy
@rthy:matrix.org
[m]
Have a look at https://scikit-learn.org/stable/developers/contributing.html for a getting starting guide.
_