Discover action insights from large-scale evaluation log data using machine learning

The feature selection and validation process is carried out with two question sets of PIAAC problem solving tests: Party Invitation and Club Membership.

Party invitation

“Party Invitation” is a problem-solving set, and the task is to place emails in folders according to the PAR. This task requires familiarity with reading comprehension and work environment.

Selecting Features – Word2Vec Results

Vectorized actions for participants in the SCORE 0 group and the score 3 group are shown in Figures 2 and 2. 3 and 4 respectively. Focused on score 0 and score 3 groups, we better emphasized the difference in action sequences between the two groups where vector space variation is most important. To compare the two groups, we divided the vector space for the score 0 group into three clusters. Cluster 1, which consists of three mail move actions (Mail_Drop, Mail_Moved, and Mail_Drag), shows a noticeable difference. Unlike the Score 3 vector space, MailMoving actions are positioned far apart from the other actions in the Score 0 vector space. Cluster 2 reveals differences in both component actions and the structure of actions. In the vector space of score 0 groups, cluster 2 is firmly grouped without subclusters, whereas in the vector space of score 3 groups it is more dispersed in potential subclusters. Furthermore, unlike the Score 0 group, mail move actions are placed in other actions in the vector space of the Score 3 group. Finally, cluster 3 in the vector space of both groups does not show significant differences as they form the cluster itself in both cases.

As highlighted in Figure 4, mail-moving actions that show the most important differences between groups with scores 0 and 3 were selected as potentially meaningful actions to distinguish between these groups. One example of a 3-group action sequence is [START, MAIL_DRAG, MAIL_VIEWED, FOLDER_VIEWED, MAIL_DROP, MAIL_MOVED, MAIL_DRAG, FOLDER_VIEWED, MAIL_DROP, MAIL_MOVED, MAIL_VIEWED, MAIL_VIEWED, MAIL_VIEWED, MAIL_DRAG, MAIL_VIEWED, FOLDER_VIEWED, MAIL_DROP, MAIL_MOVED, MAIL_VIEWED, MAIL_VIEWED, NEXT_INQUIRY, NEXT_BUTTON, CONFIRMATION_OPENED, BUTTON, DOACTION, BUTTON, CONFIRMATION_CLOSED, NEXT_ITEM] This shows the potential meaningful positions of actions and the various contextual actions they have.

Functional verification – Doc2vec, NN and RF results

As a result, we created three different case studies using mail move actions to assess the importance of mail move actions in distinguishing between groups with scores of 0 and 3. A subsequence consisting of only the original sequence (CASE1), a mail move action (case2), and a subsequence consisting of actions without the mail move action (case3). Each case except Case 1 contains a subsequence. Next, we use Doc2Vec to generate vectors from these subsequences. The silhouette scores for each case study are shown in Table 6. Figure 5 shows a visualization of a series of actions by DOC2VEC. In Figure 5, purple dots show sequences by participants in the score 0 group, while yellow dots show sequences by participants in the score 3 group. In Case 1, which includes all 35 actions, the degree of separation can be visually observed in Figure 5a, as evidenced by a silhouette score of 0.36. Case 2, focusing solely on postal delivery actions, shows the degree of enhanced separation reflected in a silhouette score of 0.491, as shown in Figure 5b. Case 3, which excludes mail move actions, shows a lower separation between the two groups compared to Case 1 and Case 2, which is indicated by a reduction in silhouette score of 0.216 (see Figure 5C).

Table 6: Three cases of DOC2VEC vectorization silhouette scores.

Silhouette scores may appear relatively low, but this can be reasonably due to two important factors. This is the distortion introduced by dimension reduction technology and variations in intracluster density. In particular, dimension reduction methods such as T-SNE and PCA are often used for visualization, but often distort the original high-dimensional distance relationships, leading to inconsistencies between visual interpretations and quantitative clustering metrics. Furthermore, variation in intracluster density refers to a phenomenon in which some clusters are tightly packed, but others are more dispersed. This heterogeneity increases the intracluster distance by averaging the intracluster distance, thereby reducing the silhouette score, even if the clusters are visually sufficiently separated.

Furthermore, the inherently noisy and overlapping nature of behavioral log data makes it difficult to obtain ideally high silhouette scores in unsupervised clustering settings. Nevertheless, of the three experimental cases, Case 2 clearly shows the most clear cluster separation in 2D projections, showing the most coherent and interpretable clustering structure from a relative performance perspective. These findings provide empirical evidence to support the validity of the clustering configuration employed in Case 2.

The results of neural networks (NN) using the DOC2VEC vector as input are displayed in Table 7, showing predictive efficiency of 0.850 to 0.962 for both F1 and accuracy. Case 2, which shows better separation than Case 1, consistent with the visualization results and silhouette scores of DOC2VEC, exhibits higher predictive efficiency compared to Case 1. Conversely, Case 3 has the lowest predictive efficiency, below 0.9 in both F1 and accuracy. The results for Random Forest (RF) were not significantly different from those for NN.

Table 7. Three cases from the score and accuracy neural network and random forest methods of F1 doc2vec vector.

In the “Party Invitation” problem set, the action to move emails was crucial for organizing emails into folders, with participants performing all three such actions and little or nothing that scored. The analysis reveals differences in frequency, sequence placement, and contextual relationships of these actions, reflecting differences in participants' skills and strategies. Using Word2vec, the email movement action for the score 3 group showed 10 adjacent actions and strong context coherence, forming tight clusters, while the score 0 group showed a pattern of variance. Functional validation enhanced the importance of mail-move actions, with Case 2 (focusing only on these actions) achieving the highest silhouette score (0.491) and classification accuracy (94.6%), and Case 3 (except these actions) showing a degradation in performance. These findings highlight the capabilities of Word2Vec, capturing noncontinuous relationships, highlighting the important role of email movement actions in performance level distinction, and contribute to improving data consistency and predictive accuracy.

Club Membership

In the “Club Membership” issue set, participants are tasked with finding and emailing a membership ID number for a specific name. This task evaluates familiarity with both lookup and mailing systems.

Selecting Features – Word2Vec Results

The vectorized actions of the correct and incorrect groups are shown in Figures 2 and 2. 6 and 7 respectively. Similar to the “Party Invitation” problem set, the vector space of the wrong group is split into three clusters for comparison. Cluster 1 shows noticeable differences between correct and incorrect groups in terms of sort actions (SS_SORT, Combobox, RADIO_BTN), email send actions (TextBox_KillFocus and TextBox_Onfocus), and environmental actions (Environment and Toolbar). In the wrong group, there is cluster 1, but in the correct group, together with the environment action (cluster 2 action), form a continuous action for clusters and mail, forming a different cluster. Cluster 3 does not show discernible differences between the two vector spaces, as they form their own clusters in both cases.

Sorting actions were excluded from the feature validation consideration as they were rare in both groups and only occurred 204 out of 934 in the correct group and 18 out of 410 in the wrong group. It was thought to be insufficient to use the sort action to compare the correct and incorrect groups due to their frequency. Therefore, the action in cluster 2, including the email-send action and the environment action highlighted in Figure 7, was chosen as a potential meaningful action to distinguish between correct and incorrect groups. [START, TOOLBAR, ENVIRONMENT, DOACTION, DOACTION, DOACTION, DOACTION, TOOLBAR, ENVIRONMENT, DOACTION, DOACTION, DOACTION, DOACTION, TEXTBOX_ONFOCUS, KEYPRESS, KEYPRESS, KEYPRESS, KEYPRESS, TEXTBOX_KILLFOCUS, NEXT_INQUIRY, NEXT_BUTTON, CONFIRMATION_OPENED, BUTTON, DOACTION, TEXTBOX_ONFOCUS, TEXTBOX_KILLFOCUS, BUTTON, CONFIRMATION_CLOSED, NEXT_ITEM, END] This shows the potential meaningful positions of actions and the various contextual actions they have.

Functional verification – Doc2vec, NN and RF results

Similar to the “Party Invitation” problem set, three cases are created, each of which, except for Case 1, is involved in the generation of subsequences. Case 1 covers all 25 actions. Case 2 focuses solely on email actions, and Case 3 includes all 21 actions, excluding email delivery actions. The silhouette scores for each case study are shown in Table 8. Figure 8 shows a visualization of a series of actions by DOC2VEC. In Figure 8, purple dots show sequences by participants in the correct group, while yellow dots show sequences by participants in the illegal group. In particular, differences between the two groups were more pronounced in Case 2, which had a higher silhouette score compared to Case 1, as shown in Figures 8a and b. In contrast, there was no action to send emails in the sequence order in Case 3, which reduced the degree of separation between the two groups, as evidenced by a silhouette score of 0.129 and visualized in Figure 8C.

Table 8: Three cases of DOC2VEC vectorization silhouette scores.

Even in this case, the silhouette score is not very high. However, as explained in the previous issue of “party invitations” it may be a limitation due to the essential characteristics of the data and variations in intracluster density. In particular, the variation in intracluster density is large, as shown in the 2D diagram. Nevertheless, a notable point of silhouette scores is that among the three cases of the experiment, the scores in Case 2 tend to be significantly higher than those in the other two cases. This suggests that the detected action sequences are meaningful enough compared to other action sequences.

For all three cases, NN efficiency, F1, and accuracy range from 0.862 to 0.917 in Table 9. In particular, when considering exclusively the mail placement and environmental actions in Case 2, efficiency is as good as one of Case 1 with a maximum F1 score of 0.917 and an accuracy of 0.876. Case 3, which omits meaningful actions in a sequence, reduces F1 scores and accuracy slightly. Similar to party invitation questions, the results of RF show very similar consistency to the results of NN.

Table 9. Score and accuracy of F1 doc2vec vectorization and three cases from the neural network and random forest methods.

In the “Club Membership” issue set, email inheritance actions were important to complete tasks, as participants had to send membership information via email. Correct group participants performed these actions consistently at the end of the sequence, consistent with task requirements, and incorrect groups performed them frequently, and performed at inconsistent points in the sequence. Analysis using word2vec revealed that correct group participants demonstrated a stronger contextual relationship between mail maintenance and adjacent actions, forming cohesive clusters. Characteristic validation further highlighted the importance of these actions, with Case 2 (focusing on meaningful actions) achieving superior silhouette scores and NN prediction efficiency compared to Case 1 (all actions) and Case 3 (except meaningful actions). The improvements were less pronounced than the “Party Invitation” set, but focused on four key actions to achieve comparable prediction accuracy and emphasized its importance. These findings suggest that trained NN models can be generalized to new datasets, allowing score prediction and comparison across a variety of features such as completion times.

Source link