1. What tricks does PCA do
In brief, PCA summarizes the data by finding linear combinations of features, which can be thought of as taking several pictures of an 3D object, and it will naturally sort the pictures by the most representative to the least before handing to you.
WIth the input being our original data, there would be 2 useful outputs of PCA: Z and W. By multiply them, we can get the reconstruction data, which is the original data but with some tolerable information loss (since we have reduced the dimensionality.)
We will explain these 2 output matrices with our data in the practice below.
2. What can we do after applying PCA
After apply PCA to our data to reduce the dimensionality, we can use it for other machine learning tasks, such as clustering, classification, and regression.
In the case of Taipei MRT later in this artical, we will perform clustering on the lower dimensional data, where a few dimensions can be interpreted as passenger proportions in different parts of a day, such as morning, noon, and evening. Those stations share similar proportions of passengers in the daytime would be consider to be in the same cluster (their patterns are alike!).
3. Take a look in our traffic dataset!
The datast we use here is Taipei Metro Rapid Transit System, Hourly Traffic Data, with columns: date, hour, origin, destination, passenger_count
.
In our case, I will keep weekday data only, since there are more interesting patterns between different stations during weekdays, such as stations in residential areas may have more commuters entering in the daytime, while in the evening, those in business areas may have more people getting in.
The plot above is 4 different staitons’ hourly traffic trend (the amount the passengers entering into the station). The 2 lines in red are Xinpu and Yongan Market, which are actually located in the super crowded areas in New Taipei City. On the otherhands, the 2 lines in blue are Taipei City Hall and Zhongxiao Fuxing, where most of the companies located and business activities happen.
The trends reflect both the nature of these areas and stations, and we can notice that the difference is most obvious when comparing their trends during commute hours (7 to 9 a.m., and 17 to 19 p.m.).
4. Using PCA on hourly traffic data
Why reducing dimensionality before conducting further machine learning tasks?
There are 2 main reasons:
- As the number of dimensions increases, the distance between any two data points becomes closer, and thus more similar and less meaningful, which would be refered to as “the curse of dimensionality”.
- Due to the high-dimensional nature of the traffic data, it is difficult to visualize and interpret.
By applying PCA, we can identify the hours when the traffic trends of different stations are most obvious and representative. Intuitively, by the plot shown previously, we can assume that hours around 8 a.m. and 18 p.m. may be representative enough to cluster the stations.
Remember we mentioned the useful output matrices, Z and W, of PCA in the previous section? Here, we are going to interpret them with our MRT case.
Original data, X
- Index : starions
- Column : hours
- Values : the proportion of passenger entering in the specific hour (#passenger / #total passengers)
With such X, we can apply PCA by the following code:
from sklearn.decomposition import PCAn_components = 3
pca = PCA(n_components=n_components)
X_tran = StandardScaler().fit_transform(X)
pca = PCA(n_components=n_components, whiten=True, random_state=0)
pca.fit(X_tran)
Here, we specify the parameter n_components to be 3, which implies that PCA will extract the 3 most significant components for us.
Note that, it is like “taking several pictures of an 3D object, and it will sort the pictures by the most representative to the least,” and we choose the top 3 pictures. So, if we set n_components to be 5, we will get 2 more pictures, but our top 3 will remain the same!
PCA output, W matrix
W can be thought of as the weights on each features (i.e. hours) with regard to our “pictures”, or more specificly, principal components.
pd.set_option('precision', 2)W = pca.components_
W_df = pd.DataFrame(W, columns=hour_mapper.keys(), index=[f'PC_{i}' for i in range(1, n_components+1)])
W_df.round(2).style.background_gradient(cmap='Blues')
For our 3 principal components, we can see that PC_1 weights more on night hours, while PC_2 weights more on noon, and PC_3 is about morning time.
PCA output, Z matrix
We can interpret Z matrix as the representations of stations.
Z = pca.fit_transform(X)# Name the PCs according to the insights on W matrix
Z_df = pd.DataFrame(Z, index=origin_mapper.keys(), columns=['Night', 'Noon', 'Morning'])
# Look at the stations we demonstrated earlier
Z_df = Z_df.loc[['Zhongxiao_Fuxing', 'Taipei_City_Hall', 'Xinpu', 'Yongan_Market'], :]
Z_df.style.background_gradient(cmap='Blues', axis=1)
In our case, as we have interpreted the W matrix and understand the latent meaning of each components, we can assign the PCs with names.
The Z matrix for these 4 stations indicates that the first 2 stations have larger proportion of night hours, while the other 2 have more in the mornings. This distribution also seconds the findings in our EDA (recall the line chart of these 4 stations in the earlier part).
5. Clustering on the PCA result with K-Means
After getting the PCA result, let’s further cluster the transit stations according to their traffic patterns, which is represented by 3principal components.
In the last section, Z matrix has representations of stations with regard to night, noon, and morning.
We will cluster the stations based on these representations, such that the stations in the same group would have similar passenger distributions among these 3 periods.
There are bunch of clustering methods, such as K-Means, DBSCAN, hierarchical clustering, e.t.c. Since the main topic here is to see the convenience of PCA, we will skip the process of experimenting which method is more suitable, and go with K-Means.
from sklearn.cluster import KMeans# Fit Z matrix to K-Means model
kmeans = KMeans(n_clusters=3)
kmeans.fit(Z)
After fitting the K-Means model, let’s visualize the clusters with 3D scatter plot by plotly.
import plotly.express as pxcluster_df = pd.DataFrame(Z, columns=['PC1', 'PC2', 'PC3']).reset_index()
# Turn the labels from integers to strings,
# such that it can be treated as discrete numbers in the plot.
cluster_df['label'] = kmeans.labels_
cluster_df['label'] = cluster_df['label'].astype(str)
fig = px.scatter_3d(cluster_df, x='PC1', y='PC2', z='PC3',
color='label',
hover_data={"origin": (pca_df['index'])},
labels={
"PC1": "Night",
"PC2": "Noon",
"PC3": "Morning",
},
opacity=0.7,
size_max=1,
width = 800, height = 500
).update_layout(margin=dict(l=0, r=0, b=0, t=0)
).update_traces(marker_size = 5)
6. Insights on the Taipei MRT traffic — Clustering results
- Cluster 0 : More passengers in daytime, and therefore it may be the “living area” group.
- Cluster 2 : More passengers in evening, and therefore it may be the “business area” group.
- Cluster 1 : Both day and night hours are full of people entering the stations, and it is more complicated to explain the nature of these stations, for there could be variant reasons for different stations. Below, we will take a look into 2 extreme cases in this cluster.
For example, in Cluster 1, the station with the largest amount of passengers, Taipei Main Station, is a huge transit hub in Taipei, where commuters are allowed to transfer from buses and railway systems to MRT here. Therefore, the high-traffic pattern during morning and evening is clear.
On the contrary, Taipei Zoo station is in Cluster 1 as well, but it is not the case of “both day and night hours are full of people”. Instead, there is not much people in either of the periods because few residents live around that area, and most citizens seldom visit Taipei Zoo on weekdays.
The patterns of these 2 stations are not much alike, while they are in the same cluster. That is, Cluster 1 might contain too many stations that are actually not similar. Thus, in the future, we would have to fine-tune hyper-parameters of K-Means, such as the number of clusters, and methods like silhouette score and elbow method would be helpful.