Quickstart ############ A quick example how perform feature reduction using ``pca``. .. code:: python import numpy as np from sklearn.datasets import load_iris import pandas as pd # Load pca from pca import pca # Load dataset label = load_iris().feature_names y = load_iris().target X = pd.DataFrame(data=load_iris().data, columns=label, index=y) # Initialize to reduce the data up to the nubmer of componentes that explains 95% of the variance. model = pca(n_components=0.95) # Reduce the data towards 3 PCs model = pca(n_components=3) # Fit transform results = model.fit_transform(X) # Data looks like this: # X=array([[5.1, 3.5, 1.4, 0.2], # [4.9, 3. , 1.4, 0.2], # [4.7, 3.2, 1.3, 0.2], # [4.6, 3.1, 1.5, 0.2], # ... # [5. , 3.6, 1.4, 0.2], # [5.4, 3.9, 1.7, 0.4], # [4.6, 3.4, 1.4, 0.3], # [5. , 3.4, 1.5, 0.2], # # y = [0, 0, 0, 0,...,2, 2, 2, 2, 2] # label = ['sepal length (cm)', # 'sepal width (cm)', # 'petal length (cm)', # 'petal width (cm)'] Compute explained variance ************************************ After the ``fit_transform``, the cumulative expained variance is stored together with the explained variance per PC. .. code:: python # Cumulative explained variance print(model.results['explained_var']) # [0.92461872 0.97768521 0.99478782] # Explained variance per PC print(model.results['variance_ratio']) [0.92461872, 0.05306648, 0.01710261] # Make plot fig, ax = model.plot() .. image:: ../figs/fig_plot.png :width: 600 :align: center PCs that cover 95% of the explained variance ************************************************************************ The number of PCs can be reduced by setting the ``n_components`` parameter. Note that the number of components can never be larger than the number of variables in your dataset. By setting ``n_components`` **larger than 1**, a feature reduction will be performed to exactly that number of components. By setting ``n_components`` **smaller than 1**, it describes the percentage of explained variance that needs to be covered at least. Or in other words, by setting ``n_components=0.95``, the number of components are extracted that cover at least 95% of the explained variance. .. code:: python # Reduce the data towards 3 PCs model = pca(n_components=3) # The number of components are extracted that cover at least 95% of the explained variance. model = pca(n_components=0.95) Scatter plot ****************** .. code:: python # 2D plot fig, ax = model.scatter() # 3d Plot fig, ax = model.scatter3d() .. |figE1| image:: ../figs/fig_scatter.png .. |figE2| image:: ../figs/fig_scatter3d.png .. table:: Color on alcohol :align: center +----------+----------+ | |figE1| | |figE2| | +----------+----------+ Biplot ****************** .. code:: python # 2D plot fig, ax = model.biplot(n_feat=4, PC=[0,1]) # 3d Plot fig, ax = model.biplot3d(n_feat=2, PC=[0,1,2]) .. image:: ../figs/fig_biplot.png :width: 600 :align: center Demonstration of feature importance ##################################################### This example is created to showcase the working of extracting features that are most important in a PCA reduction. We will create random variables with increasingly more variance. The first feature (f1) will have most of the variance, followed by feature 2 (f2) etc. .. code:: python # Print the top features. print(model.results['topfeat']) # Import libraries import numpy as np import pandas as pd from pca import pca # Lets create a dataset with features that have decreasing variance. # We want to extract feature f1 as most important, followed by f2 etc f1=np.random.randint(0,100,250) f2=np.random.randint(0,50,250) f3=np.random.randint(0,25,250) f4=np.random.randint(0,10,250) f5=np.random.randint(0,5,250) f6=np.random.randint(0,4,250) f7=np.random.randint(0,3,250) f8=np.random.randint(0,2,250) f9=np.random.randint(0,1,250) # Combine into dataframe X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9] X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9']) # Initialize and keep all PCs model = pca() # Fit transform out = model.fit_transform(X) # Print the top features. print(out['topfeat']) # The results show the expected results: f1 is the best, followed by f2 etc # PC feature # 0 PC1 f1 # 1 PC2 f2 # 2 PC3 f3 # 3 PC4 f4 # 4 PC5 f5 # 5 PC6 f6 # 6 PC7 f7 # 7 PC8 f8 # 8 PC9 f9 Explained variance plot **************************** .. code:: python model.plot() .. image:: ../figs/explained_var_1.png :width: 600 :align: center Biplot **************************** Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc. Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction. .. code:: python # 2d plot ax = model.biplot(n_feat=10, legend=False) # 3d plot ax = model.biplot3d(n_feat=10, legend=False) .. |figA1| image:: ../figs/biplot2d.png .. |figA2| image:: ../figs/biplot3d.png .. table:: Color on alcohol :align: center +----------+----------+ | |figA1| | |figA2| | +----------+----------+ Analyzing Discrete datasets ##################################################### Analyzing datasets that have continuous and catagorical values can be challanging. To demonstrate how to do this, I will use the Titanic dataset. We need to pip install df2onehot first. .. code:: bash pip install df2onehot .. code:: python import pca # Import example df = pca.import_example() # Transform data into one-hot from df2onehot import df2onehot y = df['Survived'].values del df['Survived'] del df['PassengerId'] del df['Name'] out = df2onehot(df) X = out['onehot'].copy() X.index = y from pca import pca # Initialize model1 = pca(normalize=False, onehot=False) # Run model 1 model1.fit_transform(X) # len(np.unique(model1.results['topfeat'].iloc[:,1])) model1.results['topfeat'] model1.results['outliers'] model1.plot() model1.biplot(n_feat=10) model1.biplot3d(n_feat=10) model1.scatter() model1.scatter3d() from pca import pca # Initialize model2 = pca(normalize=True, onehot=False) # Run model 2 model2.fit_transform(X) model2.plot() model2.biplot(n_feat=4) model2.scatter() model2.biplot3d(n_feat=10) # Set custom transparency levels model2.biplot3d(n_feat=10, alpha=0.5) model2.biplot(n_feat=10, alpha=0.5) model2.scatter3d(alpha=0.5) model2.scatter(alpha=0.5) # Initialize model3 = pca(normalize=False, onehot=True) # Run model 2 _=model3.fit_transform(X) model3.biplot(n_feat=3) Map unseen datapoints into fitted space ############################################## After fitting variables into the new principal component space, we can map new unseen samples into this space too. However, there is also normalization step which can be tricky because you now need standardize the values of the unseen samples first based on the previously performed standardization. This step is also integrated in the ``pca`` library by simply setting the parameter ``normalize=True``. .. code:: python # Load libraries import matplotlib.pyplot as plt from sklearn import datasets import pandas as pd from pca import pca # Load dataset data = datasets.load_wine() X = data.data y = data.target.astype(str) col_labels = data.feature_names # Initialize with normalization and take the number of components that covers at least 95% of the variance. model = pca(n_components=0.95, normalize=True) # Get some random samples across the classes idx=[0,1,2,3,4,50,53,54,55,100,103,104,105, 130, 150] X_unseen = X[idx, :] y_unseen = y[idx] # Label original dataset to make sure the check which samples are overlapping y[idx]='unseen' # Fit transform model.fit_transform(X, col_labels=col_labels, row_labels=y) # Transform new "unseen" data. Note that these datapoints are not really unseen as they are readily fitted above. # But for the sake of example, you can see that these samples will be transformed exactly on top of the orignial ones. PCnew = model.transform(X_unseen) # Plot PC space fig, ax = model.scatter(title='Map unseen samples in the existing space.') # Plot the new "unseen" samples on top of the existing space ax.scatter(PCnew.iloc[:, 0], PCnew.iloc[:, 1], marker='x', s=200) .. image:: ../figs/wine_mapping_samples.png :width: 600 :align: center Normalizing out PCs ######################### Normalize your data using the principal components. As an example, suppose there is (technical) variation in the fist component and you want that out. This function transforms the data using the components that you want, e.g., starting from the 2nd PC, up to the OC that contains at least 95% of the explained variance. .. code:: python print(X.shape) (178, 13) # Normalize out 1st component and return data Xnorm = model.norm(X, pcexclude=[1]) # The data remains the same samples and variables but the all variance that covered the 1st PC is removed. print(Xnorm.shape) (178, 13) # In this case, PC1 is "removed" and the PC2 has become PC1 etc ax = pca.biplot(model, col_labels=col_labels, row_labels=y) Colors in plots ######################### The default colors that are used in the plots depend on how much information is provided at start. There are many parameters to change the colors in the plots. Here I will demonstrate some of the possibilities. First, we will load the data and import the libraries. .. code:: python # Import iris dataset and other required libraries from sklearn.datasets import load_iris import pandas as pd import matplotlib as mpl import colourmap # Import pca from pca import pca # Class labels y = load_iris().target # Initialize pca model = pca(n_components=3, normalize=True) # Dataset X = pd.DataFrame(index=y, data=load_iris().data, columns=load_iris().feature_names) # Fit transform out = model.fit_transform(X) Lets start with the default plot using hte classlabels (y), and change it using a custom cmap. .. code:: python # The default setting is to color on classlabels (y). These are provided as the index in the dataframe. model.biplot() # Use custom cmap for classlabels (as an example I explicitely provide three colors). model.biplot(cmap=mpl.colors.ListedColormap(['green', 'red', 'blue'])) .. |figE3| image:: ../figs/color_default.png .. |figE4| image:: ../figs/color_cmap.png .. table:: Left: Default plot using the provided classlabels. Right: Color on custom cmap. :align: center +----------+----------+ | |figE3| | |figE4| | +----------+----------+ If you want to highlight some samples in the graph, you easily change the classlabels. The colors are automatically created using the specified colormap. However, this can cause that the points of interest can still be difficult to find. Therefore it is also possible to set the input colors for each sample manually. .. code:: python # Set custom classlabels. Coloring is based on the input colormap (cmap). y[10:15]=4 model.biplot(labels=y, cmap='Set2') # Set custom classlabels and also use custom colors. c = colourmap.fromlist(labels, cmap='Set2')[0] c[10:15] = [0,0,0] model.biplot(labels=y, c=c) .. |figE5| image:: ../figs/color_cmap_y.png .. |figE6| image:: ../figs/color_using_custom_colors.png .. table:: Left: Mark some points on y and use cmap. Right: Specify the colors manually. :align: center +----------+----------+ | |figE5| | |figE6| | +----------+----------+ The highlight the loadings, all scatterpoints can be removed by setting the cmap to None. .. code:: python # Remove scatterpoints by setting cmap=None model.biplot(cmap=None) # Gradient with white ending using the cmap setting. model.biplot(labels=y, gradient='#ffffff', cmap=mpl.colors.ListedColormap(['green', 'red', 'blue'])) .. |figE7| image:: ../figs/color_no_scatter.png .. |figE8| image:: ../figs/color_gradient.png .. table:: Left: Remove scatterpoints from plot. Right: Gradient with the used cmap. :align: center +----------+----------+ | |figE7| | |figE8| | +----------+----------+ It is also possible to input a fig as parameter to the plot. This will allow to make iterative changes. .. code:: python from sklearn.datasets import make_friedman1 X, _ = make_friedman1(n_samples=200, n_features=30, random_state=0) # Init model = pca() # Fit model.fit_transform(X) # Make plot with blue arrows and text fig, ax = model.biplot(c=[0,0,0], s=25, arrowdict={'fontsize':10, 'weight':'normal'}, color_arrow='blue', title=None, HT2=True, n_feat=10, visible=True) # Use the existing fig and create new edits such red arrows for the first three loadings. Also change the font sizes. fig, ax = model.biplot(c=[0,0,0], s=25, arrowdict={'fontsize':16, 'weight':'bold'}, color_arrow='red', n_feat=3, title='updated fig.', visible=True, fig=fig) .. |figE9| image:: ../figs/fig_iterative_changes.png .. table:: Fig as input to make iterative changes. :align: center +----------+ | |figE9| | +----------+ .. include:: add_bottom.add