
I'm studying machine learning with sklearn and trying to understand ROCcurves for binary classification. I wrote a simple classifier that fits onedimensional PCA and predict point as an outlier if ...

i am trying to make my hog traslation faster but it seems that the more threads i adding the slower it gets help??
def worker(patches,segment_start,segment_end,hog_array):
for seg in range(...

I'm generating random samples for binary classification problem:
X, y = make_classification(n_features=40, n_redundant=4, n_informative=36,n_clusters_per_class=2, n_samples=50000)
I want to check ...

it's the first time I'm approaching the scikit library. So I have this dataset which is already "cleaned" and is called dedups:
print(dedups)
name price vehicleType \
1 ...

So I just came across this error:
Exception has occurred: MemoryError
Unable to allocate 48.0 MiB for an array with shape (64, 256, 256, 3) and data type float32
File "D:\Uni\MSc\ML\...

I have a marketing data set that I am using to predict returns based on spends. I am currently using a linear regression model. The issue is that the line only fits the data to a certain point, then ...

I would like to utilize Catboost to perform RFECV:
Sample code here:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import ...

I have a dataset of 4 columns: "id", "keywords", "age", and "sex. The aim of my project is to determine a person's age and sex based on some text of theirs. My keywords are in the following format for ...

I want to find out the outliers from the dataset. The dataset contains more than 35 columns with multivariate data.
So I have used sklearn library with kmeans and PCA to predict the outliers. I can ...

i have a problem that I've been asked to tackle. This is just a sample with 6 periods but i will be handling larger samples than this. I need to predict for period 7 what would be the demand outcome ...

I'm a newbie in this DataScience realm and in order to organize my code I'm using pipeline.
The snippet of the code I'm trying to organize follows:
### Preprocessing ###
# Preprocessing for ...

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model ...

I am using a scikitlearn pipeline with XGBRegressor.
Pipeline is working good without any error. When I am prediction with this pipeline, I am predicting the same data multiple times, Sometimes out ...

For handling the clustering model, which library is preferred? scipy.cluster or
sklearn.cluster

I follow this issue to calculate feature importance on decision tree:
scikit learn  feature importance calculation in decision trees
However, I can't get correct value on calculating feature ...

I am using ConfusionMatrixDisplay from sklearn library to plot a confusion matrix on two lists I have and while the results are all correct, there is a detail that bothers me. The color's density in ...

I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each ...

Can someone please direct me to a tutorial provide a starting idea for the problem given below.
I have a mapping of Authors to co authors given as follows:
mapping
>>
{0: [2860, 3117],
1: [...

It is my code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from sklearn import preprocessing
df_train = pd.read_csv('Iris_train.csv')
...

I am using the Titanic dataset. I have done oneHotEncoding on 3 categories survived,sex,cabin.
encoder = OneHotEncoder(categories='auto',
drop='first',
sparse=...

After we have converted our categorical variables to dummy variables for training the model. We tend to find feature importance. But sklearn's model.feature_importance_ object returns feature ...

When I am trying to write from "sklearn.preprocessing import Imputer", I am getting "Cannot find reference Imputer in__int__.py". I installed sklearn library and the pip version is 19.2.3. Can anyone ...

So I am pretty new to python in general and I am trying to follow a tutorial to normalize and scale all of my data; however, I keep getting an error. I am using Scikitlearn with pandas. I've searched ...

Is there a way to extract or compute the feature names and level names for a design matrix in scikitlearn? Here's an example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import ...

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I ...

Before asking this question, I assure you I spent 2 days researching this topic on the Internet. As I failed to find a concrete answer, I am taking this question here.
I am new to data science, and I ...

Im trying to train a model, however when I fit the model, I am getting the following error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 3608]
Here is my code:
data ...

I'm new to Gaussian processes and struggling to validate the output of my scikit GPR.
I'm particularly concerned with the fact that my GPR returns a score of 1, which doesn't make any sense to me ...

As you can read from the title, self from a class instance is not the class instance itself.
This happens when I use a custom class with scikitlearn pipelines, but not when I use the same custom ...

I'm working on some ML classification problem on jupyter notebook. consider following code
Code (cell 1)
# all imports goes here
w.filterwarnings('ignore')
# define scoring method
scoring = '...

I am using a GradientBoostingRegressor to forecast the next 48 values in a timeseries. To do so, I wrap the GradientBoostingRegressor in a MultiOutputRegressor().
What I would like to do to help the ...

I am working on building a lasso model now and used coefficient as an indicator of variable importance. since the scale of my boolean variables are different with others (after standardization), many ...

For my project, I work with three dimensional MRI data, where the fourth dimension represents different subjects (I use the package nilearn for this). I am using sklearn.decomposition.PCA to extract a ...

Here is a very small example using precision_recall_curve():
from sklearn.metrics import precision_recall_curve, precision_score, recall_score
y_true = [0, 1]
y_predict_proba = [0.25,0.75]
precision, ...

My input is of this type:
My desired ouput will be if the format:
How can do this with pandas or scikit learn?

Running linear regression and trying to calculate my r^2 but i'm getting a negative value, how is this possible? I thought it could only be between 0 and 1. I ran kmeans with 4 clusters, then i made ...

I am using a MOEA to solve DTLZ1 and DTLZ2, using a fitness function for DTLZ1 using the sum of the objectives is fine as the pareto front is strait, but for DTLZ2 as its convex, the solutions ...

When importing sklearn I have following error:
python3: Relink `/home/xx/xx/xx/xx/xx/xx/lib/python3.6/sitepackages/sklearn/__check_build/../.libs/libgomp3300acd3.so.1.0.0' with `/lib/x86_64linux...

I'm running a grid search for model parameter tuning. It works when using n_jobs=1 without dask backend. But when I switch to dask, I'm getting the following error:
ValueError: X has 205995 ...

I have a set of boreholes with a different number of samples in each hole. There is a different number of samples in each hole. I want to try training the model on a single borehole and testing on the ...

I am trying to learn Simple Imputer on the data set provided on the course tab on Kaggle  https://www.kaggle.com/alexisbcook/missingvalues
CSV file is available on above link.
While trying out the ...

For a side project of mine, I am trying to build a Naives Bayes model that can detect if a piece of news is fake based on the headline. Here is my code so far:
import numpy as np
import pandas as pd
...

Does performing grid search on hyperparameters guarantee improved performance when tested on the same data set?
I ask because my intuition was "yes", however I got slightly lower scores after tuning ...

I created a clustering model to try and find different groups of customers based on annual income and spending score using the KMeans algorithm from ScikitLearn. Using the cluster value that it ...

I am trying to perform onehot encoding on some categorical columns. From the tutorial I am following, I am supposed to do LabelEncoding before One hot encoding. I have successfully performed the ...

What features, advantages and disadvantages in terms of constructing dividing planes, the quality of separation of classes and computational efficiency are characteristic of each of the following ...

I'm running a bunch of models with scikitlearn to solve a classification problem.
How do I iterate through different scikitlearn models?
from sklearn.ensemble import AdaBoostClassifier
from ...

I have been reading the book Handson Machine Learning with ScikitLearn and Tensorflow and I found this code:
from sklearn.model_selection import StratifiedShuffleSplit
split = ...

I am trying to print out all of the imputation values after fitting with SimpleImputer. When using SimpleImputer by itself, I can retrieve these from the instance's statistics_ attribute.
This works ...

I have created an estimator to clean a column and then used this estimator for column transformation. Here is my estimator
class CabinImputer(BaseEstimator, TransformerMixin):
def __init__(self, ...