@bgarcial wrote:
I have the following dataset represented like
numpy
arraydireccion_viento_pos Out[32]: array([['S'], ['S'], ['S'], ..., ['SO'], ['NO'], ['SO']], dtype=object)
The dimension of this array is:
direccion_viento_pos.shape (17249, 8)
I am using python and scikit learn to encode these categorical variables in this way:
from __future__ import unicode_literals import pandas as pd import numpy as np # from sklearn import preprocessing # from matplotlib import pyplot as plt from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Then I create a label encoder object:
labelencoder_direccion_viento_pos = LabelEncoder()
I take the column position 0 (the unique column) of the
direccion_viento_pos
and apply thefit_transform()
method addressing all their rowsdireccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])
My
direccion_viento_pos
is of this way:direccion_viento_pos[:, 0] array([5, 5, 5, ..., 7, 3, 7], dtype=object)
Until this moment, each row/observation of
direccion_viento_pos
have a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.Due to this, I create the dummy variables, which according to this reference are:
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels
Then, in my
direccion_viento_pos
context, I have 8 values
SO
- Sur oesteSE
- Sur esteS
- SurN
- NorteNO
- Nor oesteNE
- Nor esteO
- OesteE
- EsteThis mean, 8 categories.
Next, I create a OneHotEncoder object with thecategorical_features
attribute which specifies what features will be treated like categorical variables.
onehotencoder = OneHotEncoder(categorical_features = [0])
And apply this
onehotencoder
to ourdireccion_viento_pos
matrix.
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()
My
direccion_viento_pos
with their categorized variables has stayed so:direccion_viento_pos array([[0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 1.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 1.]])
Then, until here, I’ve created dummy variables to each category.
I wanted to narrate this process, to arrive at my question.
If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?
Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions
I ask this question due to when I apply the
MinMaxScaler
with thefeature_range=(0, 1)
my values have been changed in some positions … despite to still keep this scale.What is the best option which can I have to choose with respect to my dataset
direccion_viento_pos
Posts: 4
Participants: 3