Statistical comparison of data sources performance

This section aims to determine if there are differences in how each data source, i.e., smartphone (sp), smartwatch (sw) and fused datasets, performs in the selected models for HAR.

Plotly loading issue

This page contains Plotly interactive figures. Sometimes, the figures might not load properly and show a blank image. Reloading the page might solve the loading issue.

Note

As shown in Impact of the amount of training data, the models results do not follow a normal distribution. Therefore, the following comparisons employ non-parametric tests.

Code

import os

from itables import show
from libs.chapter3.analysis.data_loading import load_reports, load_best_significant
from libs.chapter3.analysis.model import Model, Source, ActivityMetric, ModelMetric, TargetFilter, obtain_best_items
from libs.chapter3.analysis.statistical_tests import statistical_comparison
from libs.chapter3.analysis.visualization import plot_visual_comparison, plot_visual_ties


MODELS = [Model.MLP, Model.CNN, Model.LSTM, Model.CNN_LSTM]
SOURCES = [Source.SP, Source.SW, Source.FUSED]
SIGNIFICANCE_FILE = os.path.join('data', 'chapter3', 'significance', 'best_sources.csv')

reports = load_reports()

Overall performance

Regarding the accuracy of the models (Table 7.1), the smartwatch dataset always presents the best performance with few amounts of data (i.e., \(n \in [3, 4]\)), while the fused dataset is the best with high amounts of data across all models. When comparing smartphone and smartwatch in the highest amounts of data, the smartphone is superior in the MLP models, while the smartwatch is better in the LSTM models, but no significant differences are appreciated in the CNN and CNN-LSTM models.

The post-hoc tests can be found in Table 7.2.

Code

datasource_overlall_tests, datasource_overall_posthoc = statistical_comparison(
    reports,
    (TargetFilter.MODEL, ModelMetric.ACCURACY), 
    MODELS, 
    SOURCES
)
datasource_overlall_tests

Table 7.1: Statistical comparison of overall accuracies obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.202	0.528	0.242	147.696	0.000	0.390	0.678	0.511	111.888	0.000	0.283	0.650	0.437	119.849	0.000	0.426	0.624	0.475	67.193	0.000
2	0.491	0.679	0.491	71.194	0.000	0.659	0.755	0.716	33.542	0.000	0.583	0.728	0.661	47.641	0.000	0.682	0.739	0.654	20.235	0.000
3	0.609	0.710	0.624	36.563	0.000	0.719	0.788	0.787	36.940	0.000	0.665	0.776	0.739	55.392	0.000	0.710	0.774	0.767	15.792	0.000
4	0.700	0.736	0.729	10.771	0.005	0.776	0.807	0.821	16.354	0.000	0.694	0.791	0.792	45.746	0.000	0.759	0.801	0.799	9.945	0.007
5	0.743	0.766	0.759	2.758	0.252	0.811	0.829	0.838	15.710	0.000	0.753	0.812	0.796	42.896	0.000	0.800	0.816	0.830	14.262	0.001
6	0.739	0.777	0.788	8.468	0.014	0.798	0.834	0.849	26.631	0.000	0.758	0.829	0.812	61.456	0.000	0.793	0.823	0.839	24.431	0.000
7	0.782	0.778	0.795	3.121	0.210	0.837	0.838	0.852	13.777	0.001	0.771	0.831	0.821	56.416	0.000	0.819	0.823	0.847	16.876	0.000
8	0.789	0.795	0.802	4.698	0.095	0.836	0.840	0.858	15.600	0.000	0.786	0.842	0.832	47.311	0.000	0.821	0.833	0.852	26.122	0.000
9	0.813	0.803	0.819	16.174	0.000	0.843	0.857	0.872	18.848	0.000	0.813	0.846	0.849	22.774	0.000	0.823	0.852	0.869	36.807	0.000
10	0.819	0.807	0.840	24.966	0.000	0.854	0.858	0.875	17.747	0.000	0.826	0.851	0.860	24.270	0.000	0.838	0.849	0.874	35.307	0.000
11	0.825	0.803	0.840	32.831	0.000	0.856	0.854	0.874	21.060	0.000	0.826	0.850	0.861	22.877	0.000	0.838	0.849	0.870	32.273	0.000
12	0.831	0.803	0.847	48.143	0.000	0.855	0.858	0.877	19.258	0.000	0.830	0.852	0.866	24.382	0.000	0.835	0.854	0.883	54.700	0.000
13	0.837	0.810	0.849	64.810	0.000	0.861	0.859	0.887	34.245	0.000	0.846	0.856	0.867	13.828	0.001	0.851	0.856	0.888	66.836	0.000
14	0.841	0.815	0.855	62.088	0.000	0.869	0.863	0.889	36.509	0.000	0.848	0.861	0.875	22.951	0.000	0.854	0.856	0.886	49.737	0.000
15	0.842	0.815	0.861	93.565	0.000	0.869	0.864	0.889	32.564	0.000	0.850	0.863	0.875	16.633	0.000	0.852	0.858	0.890	72.258	0.000
16	0.841	0.814	0.864	88.822	0.000	0.870	0.868	0.890	31.497	0.000	0.850	0.866	0.882	33.638	0.000	0.853	0.860	0.894	90.418	0.000
17	0.851	0.816	0.867	106.636	0.000	0.877	0.867	0.895	36.744	0.000	0.855	0.867	0.878	14.331	0.001	0.856	0.861	0.894	75.250	0.000
18	0.846	0.815	0.867	107.590	0.000	0.877	0.867	0.896	40.226	0.000	0.854	0.869	0.879	17.394	0.000	0.858	0.864	0.894	78.713	0.000
19	0.856	0.821	0.872	123.329	0.000	0.873	0.869	0.896	40.944	0.000	0.855	0.870	0.884	20.879	0.000	0.859	0.868	0.896	87.531	0.000
20	0.852	0.817	0.876	136.192	0.000	0.880	0.872	0.898	51.361	0.000	0.864	0.872	0.886	32.094	0.000	0.859	0.866	0.897	88.692	0.000
21	0.854	0.824	0.873	121.986	0.000	0.884	0.871	0.897	46.619	0.000	0.864	0.871	0.885	15.898	0.000	0.863	0.867	0.900	84.458	0.000
22	0.861	0.822	0.879	144.736	0.000	0.886	0.874	0.902	50.409	0.000	0.865	0.875	0.889	27.551	0.000	0.868	0.869	0.901	89.274	0.000

Code

show(datasource_overall_posthoc, classes="display nowrap compact")

Table 7.2: Post-hoc tests for Table 7.1 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

Activitiy-wise performance

Next, we focus on how each data source affects the performance of individual activities with each selected model.

`SEATED`

Results from Table 7.3 and Table 7.4 (post-hoc) show that the smartwatch-trained models obtain the best scores for the SEATED activity with any amount of data and across all the models.

In the case of the CNN and CNN-LSTM models in \(n \geq 10\), the fused-trained models also achieve the best results, with no significant differences with the smartwatch-trained models. Therefore, the smartphone-trained models achieve the worst results in these models.

Regarding the MLP and LSTM models, when trained with smartphone and fused data their performance is significantly worse than when trained with smartwatch data. Between the smartphone- and fused-trained models, no significant differences exist in the MLP model with high amounts of data, although differences are observed in the LSTM model in favour of the fused-trained models.

Code

datasource_seated_tests, datasource_seated_posthoc = statistical_comparison(
    reports,
    (TargetFilter.SEATED, ActivityMetric.F1),
    MODELS, 
    SOURCES
)
datasource_seated_tests

Table 7.3: Statistical comparison of SEATED performance obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.065	0.397	0.116	84.149	0.0	0.000	0.772	0.388	216.958	0.0	0.083	0.749	0.316	258.576	0.0	0.084	0.644	0.136	58.067	0.0
2	0.210	0.706	0.456	121.818	0.0	0.273	0.830	0.716	204.815	0.0	0.176	0.823	0.650	248.303	0.0	0.737	0.826	0.705	63.866	0.0
3	0.245	0.805	0.543	185.279	0.0	0.593	0.849	0.800	163.509	0.0	0.431	0.844	0.741	201.328	0.0	0.788	0.850	0.816	35.338	0.0
4	0.498	0.840	0.673	161.477	0.0	0.733	0.864	0.823	110.338	0.0	0.587	0.851	0.800	176.345	0.0	0.800	0.858	0.827	42.377	0.0
5	0.660	0.841	0.685	143.959	0.0	0.783	0.868	0.839	85.974	0.0	0.653	0.856	0.800	148.075	0.0	0.812	0.858	0.853	42.617	0.0
6	0.645	0.850	0.720	152.244	0.0	0.774	0.869	0.839	104.495	0.0	0.669	0.857	0.800	186.511	0.0	0.797	0.875	0.849	78.750	0.0
7	0.727	0.849	0.702	129.278	0.0	0.794	0.866	0.854	68.601	0.0	0.735	0.862	0.828	138.187	0.0	0.821	0.872	0.868	42.119	0.0
8	0.738	0.868	0.717	164.616	0.0	0.810	0.873	0.861	66.231	0.0	0.758	0.867	0.833	121.404	0.0	0.811	0.877	0.870	53.550	0.0
9	0.778	0.871	0.793	117.282	0.0	0.833	0.880	0.861	54.662	0.0	0.797	0.870	0.845	68.948	0.0	0.836	0.879	0.870	45.063	0.0
10	0.767	0.872	0.787	108.284	0.0	0.830	0.877	0.872	58.546	0.0	0.794	0.871	0.844	92.679	0.0	0.833	0.881	0.875	51.054	0.0
11	0.780	0.870	0.798	103.231	0.0	0.827	0.878	0.867	52.735	0.0	0.799	0.866	0.844	79.081	0.0	0.830	0.878	0.865	42.425	0.0
12	0.787	0.876	0.808	109.682	0.0	0.833	0.880	0.876	48.220	0.0	0.811	0.876	0.862	81.305	0.0	0.833	0.882	0.875	49.322	0.0
13	0.807	0.877	0.816	81.070	0.0	0.831	0.884	0.871	60.920	0.0	0.806	0.877	0.847	84.146	0.0	0.831	0.878	0.874	55.169	0.0
14	0.802	0.877	0.813	91.533	0.0	0.842	0.885	0.877	45.794	0.0	0.828	0.872	0.863	49.187	0.0	0.828	0.879	0.874	56.553	0.0
15	0.801	0.873	0.835	78.908	0.0	0.837	0.882	0.872	42.675	0.0	0.830	0.876	0.858	52.279	0.0	0.833	0.882	0.880	60.678	0.0
16	0.809	0.875	0.841	72.361	0.0	0.847	0.885	0.876	36.096	0.0	0.823	0.887	0.857	71.288	0.0	0.843	0.886	0.880	52.916	0.0
17	0.807	0.881	0.833	70.534	0.0	0.848	0.880	0.877	38.414	0.0	0.830	0.876	0.865	53.088	0.0	0.836	0.881	0.880	50.416	0.0
18	0.807	0.882	0.838	76.805	0.0	0.842	0.885	0.875	39.895	0.0	0.837	0.886	0.861	60.194	0.0	0.843	0.882	0.880	55.166	0.0
19	0.812	0.880	0.846	67.328	0.0	0.842	0.889	0.875	51.272	0.0	0.829	0.878	0.865	52.793	0.0	0.837	0.881	0.885	61.821	0.0
20	0.813	0.885	0.842	69.846	0.0	0.851	0.884	0.877	29.107	0.0	0.830	0.883	0.868	59.387	0.0	0.844	0.884	0.883	53.625	0.0
21	0.814	0.875	0.844	57.710	0.0	0.854	0.881	0.875	27.843	0.0	0.842	0.886	0.867	41.392	0.0	0.837	0.883	0.885	67.289	0.0
22	0.826	0.879	0.845	59.162	0.0	0.850	0.887	0.875	34.946	0.0	0.851	0.887	0.867	44.372	0.0	0.842	0.883	0.885	59.259	0.0

Code

show(datasource_seated_posthoc, classes="display nowrap compact")

Table 7.4: Post-hoc tests for Table 7.3 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`STANDING_UP`

Table 7.5, 7.6 show a similar pattern across all models, where the smartwatch-trained models produce the best results with low and medium amounts of data while the fused-trained models also are the best-performing with medium and high amounts of data.

This pattern can be observed in the MLP, CNN and CNN-LSTM models, although the value of \(n\) where the fused-trained models start to outperform the smartwatch-trained models varies. In the case of the CNN and CNN-LSTM, the models trained with smartwatch data are significantly better than ones trained with smartphone data, while the contrary occurs in the MLP model. In the remaining model, the LSTM, since no differences exist between the smartwatch- and fused-trained models, the smartphone-trained models provide the worst results.

Code

datasource_standing_tests, datasource_standing_posthoc = statistical_comparison(
    reports,
    (TargetFilter.STANDING_UP, ActivityMetric.F1),
    MODELS, 
    SOURCES
)
datasource_standing_tests

Table 7.5: Statistical comparison of STANDING_UP performance obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.060	0.434	0.141	157.367	0.000	0.130	0.583	0.320	180.914	0.0	0.093	0.546	0.290	204.095	0.0	0.139	0.517	0.275	165.336	0.0
2	0.209	0.555	0.320	113.170	0.000	0.362	0.675	0.563	95.268	0.0	0.272	0.632	0.481	120.194	0.0	0.318	0.656	0.487	114.961	0.0
3	0.324	0.563	0.469	75.872	0.000	0.482	0.716	0.667	81.443	0.0	0.369	0.667	0.550	111.214	0.0	0.494	0.681	0.646	73.127	0.0
4	0.485	0.613	0.594	43.798	0.000	0.609	0.767	0.738	62.602	0.0	0.470	0.717	0.670	91.930	0.0	0.600	0.730	0.722	56.548	0.0
5	0.496	0.634	0.614	31.293	0.000	0.657	0.781	0.749	58.233	0.0	0.541	0.753	0.699	102.212	0.0	0.638	0.760	0.742	58.544	0.0
6	0.514	0.637	0.626	26.012	0.000	0.643	0.800	0.766	77.163	0.0	0.528	0.771	0.701	109.180	0.0	0.626	0.772	0.760	61.746	0.0
7	0.590	0.658	0.667	17.192	0.000	0.693	0.809	0.777	59.785	0.0	0.564	0.768	0.713	99.197	0.0	0.699	0.786	0.777	39.730	0.0
8	0.607	0.674	0.678	13.616	0.001	0.713	0.804	0.793	47.931	0.0	0.642	0.796	0.732	91.323	0.0	0.699	0.788	0.793	51.247	0.0
9	0.651	0.694	0.731	15.759	0.000	0.745	0.823	0.813	51.888	0.0	0.704	0.811	0.777	52.882	0.0	0.726	0.805	0.809	59.026	0.0
10	0.643	0.698	0.732	18.579	0.000	0.769	0.830	0.818	35.898	0.0	0.707	0.824	0.789	70.136	0.0	0.743	0.818	0.822	45.094	0.0
11	0.681	0.696	0.735	11.883	0.003	0.775	0.823	0.827	38.852	0.0	0.724	0.814	0.791	56.135	0.0	0.771	0.814	0.824	39.276	0.0
12	0.675	0.688	0.740	10.138	0.006	0.748	0.830	0.825	50.593	0.0	0.721	0.825	0.786	57.273	0.0	0.756	0.827	0.840	45.765	0.0
13	0.696	0.693	0.766	19.114	0.000	0.775	0.833	0.841	49.029	0.0	0.762	0.832	0.800	48.930	0.0	0.771	0.819	0.847	54.329	0.0
14	0.712	0.697	0.755	17.164	0.000	0.788	0.832	0.838	39.853	0.0	0.765	0.835	0.813	58.304	0.0	0.777	0.826	0.839	39.972	0.0
15	0.721	0.703	0.769	27.619	0.000	0.792	0.845	0.848	42.921	0.0	0.765	0.842	0.821	56.332	0.0	0.769	0.839	0.851	54.770	0.0
16	0.726	0.715	0.779	20.737	0.000	0.800	0.843	0.857	47.573	0.0	0.765	0.841	0.830	58.934	0.0	0.791	0.836	0.861	59.137	0.0
17	0.738	0.712	0.778	27.224	0.000	0.791	0.842	0.857	50.423	0.0	0.780	0.843	0.825	44.072	0.0	0.783	0.832	0.857	51.329	0.0
18	0.724	0.716	0.789	29.998	0.000	0.805	0.843	0.859	36.824	0.0	0.779	0.854	0.825	50.413	0.0	0.786	0.833	0.860	51.852	0.0
19	0.753	0.712	0.794	36.159	0.000	0.809	0.847	0.860	40.092	0.0	0.788	0.846	0.844	43.787	0.0	0.798	0.831	0.870	56.029	0.0
20	0.749	0.714	0.798	44.079	0.000	0.805	0.850	0.869	57.402	0.0	0.789	0.852	0.845	56.008	0.0	0.800	0.838	0.862	45.284	0.0
21	0.753	0.717	0.800	41.653	0.000	0.809	0.847	0.871	44.334	0.0	0.800	0.846	0.847	36.116	0.0	0.802	0.837	0.865	56.834	0.0
22	0.745	0.714	0.800	50.242	0.000	0.821	0.857	0.871	41.424	0.0	0.792	0.851	0.859	47.474	0.0	0.807	0.844	0.872	58.018	0.0

Code

show(datasource_standing_posthoc, classes="display nowrap compact")

Table 7.6: Post-hoc tests for Table 7.5 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`WALKING`

Table 7.7, 7.8 show different patterns regarding the model employed. For the MLP and CNN models, the smartwatch models are the best with few quantities of data, but no significant differences among data sources are appreciated in \(n \in [4,5]\). After \(n \geq 6\), the smartphone- and fused-trained models obtain the best results.

In the LSTM and the CNN-LSTM, the smartwatch-trained models are the best-performing with low amounts of data, while with medium and high quantities of data, the fused-trained models are the best. Regarding the models trained with smartphone and smartwatch data, significant differences exist in the CNN-LSTM models in favour of the smartphone-trained models, but not on the LSTM.

Code

datasource_walking_tests, datasource_walking_posthoc = statistical_comparison(
    reports,
    (TargetFilter.WALKING, ActivityMetric.F1),
    MODELS, 
    SOURCES
)
datasource_walking_tests

Table 7.7: Statistical comparison of WALKING performance obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.027	0.679	0.071	142.659	0.000	0.471	0.765	0.620	65.950	0.000	0.347	0.734	0.576	91.566	0.000	0.509	0.736	0.603	47.913	0.000
2	0.599	0.778	0.667	48.257	0.000	0.775	0.833	0.811	10.585	0.005	0.728	0.801	0.759	21.695	0.000	0.794	0.810	0.768	7.564	0.023
3	0.764	0.795	0.780	7.010	0.030	0.826	0.852	0.861	12.450	0.002	0.784	0.841	0.803	20.031	0.000	0.820	0.843	0.836	1.628	0.443
4	0.817	0.815	0.813	0.201	0.904	0.859	0.865	0.875	4.437	0.109	0.799	0.860	0.847	20.265	0.000	0.847	0.861	0.853	1.700	0.428
5	0.856	0.835	0.847	3.367	0.186	0.889	0.883	0.888	4.750	0.093	0.838	0.870	0.853	14.631	0.001	0.874	0.867	0.883	9.651	0.008
6	0.845	0.848	0.856	0.333	0.847	0.874	0.886	0.893	7.039	0.030	0.833	0.877	0.860	27.379	0.000	0.874	0.870	0.885	7.491	0.024
7	0.874	0.849	0.876	12.934	0.002	0.899	0.887	0.901	10.127	0.006	0.842	0.883	0.880	30.691	0.000	0.898	0.871	0.894	15.912	0.000
8	0.870	0.855	0.882	16.806	0.000	0.895	0.886	0.904	10.418	0.005	0.848	0.889	0.877	30.343	0.000	0.884	0.879	0.898	15.809	0.000
9	0.893	0.859	0.887	35.791	0.000	0.905	0.900	0.915	14.265	0.001	0.882	0.893	0.890	7.437	0.024	0.887	0.893	0.914	25.310	0.000
10	0.898	0.865	0.899	41.057	0.000	0.910	0.901	0.915	14.139	0.001	0.884	0.894	0.904	10.507	0.005	0.904	0.892	0.913	29.421	0.000
11	0.898	0.868	0.899	49.683	0.000	0.907	0.897	0.912	11.350	0.003	0.879	0.894	0.902	12.538	0.002	0.899	0.892	0.912	20.069	0.000
12	0.905	0.866	0.904	73.540	0.000	0.911	0.901	0.914	17.755	0.000	0.887	0.899	0.904	8.186	0.017	0.896	0.893	0.919	41.693	0.000
13	0.905	0.869	0.908	91.948	0.000	0.917	0.901	0.924	36.930	0.000	0.900	0.898	0.907	5.347	0.069	0.906	0.896	0.925	62.834	0.000
14	0.909	0.870	0.908	85.674	0.000	0.916	0.907	0.925	28.561	0.000	0.903	0.906	0.917	10.336	0.006	0.910	0.898	0.925	52.277	0.000
15	0.909	0.874	0.915	109.023	0.000	0.922	0.908	0.925	37.932	0.000	0.902	0.905	0.914	5.706	0.058	0.908	0.900	0.927	60.843	0.000
16	0.910	0.869	0.914	100.301	0.000	0.925	0.909	0.925	38.392	0.000	0.904	0.905	0.919	18.383	0.000	0.909	0.899	0.930	83.681	0.000
17	0.914	0.871	0.918	121.710	0.000	0.925	0.908	0.928	37.226	0.000	0.903	0.909	0.916	4.562	0.102	0.910	0.902	0.928	66.523	0.000
18	0.913	0.873	0.916	128.281	0.000	0.928	0.908	0.928	48.604	0.000	0.909	0.906	0.918	10.981	0.004	0.914	0.902	0.933	82.070	0.000
19	0.916	0.875	0.918	148.882	0.000	0.927	0.908	0.929	51.017	0.000	0.905	0.906	0.919	13.093	0.001	0.918	0.904	0.932	84.062	0.000
20	0.916	0.874	0.922	145.974	0.000	0.928	0.912	0.930	48.167	0.000	0.912	0.910	0.925	17.917	0.000	0.917	0.906	0.932	79.506	0.000
21	0.917	0.880	0.923	134.989	0.000	0.931	0.913	0.931	62.152	0.000	0.915	0.913	0.925	9.854	0.007	0.918	0.905	0.933	75.168	0.000
22	0.918	0.877	0.923	169.960	0.000	0.932	0.913	0.933	69.003	0.000	0.912	0.911	0.925	19.418	0.000	0.921	0.907	0.935	88.114	0.000

Code

show(datasource_walking_posthoc, classes="display nowrap compact")

Table 7.8: Post-hoc tests for Table 7.7 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`TURNING`

The results presented in Table 7.9, 7.10 indicate that the smartphone- and fused-trained models obtain the best metrics in almost any case.

The smartphone-trained models consistently obtain the best results with any amount of data across all models. On the other hand, the fused-trained models require medium amounts of data to equal the smartphone results in the MLP and CNN-LSTM. The smartwatch-trained models only perform well when the minimum amount of training data is used. For any other quantities, they provide the worst results.

Code

datasource_turning_tests, datasource_turning_posthoc = statistical_comparison(
    reports,
    (TargetFilter.TURNING, ActivityMetric.F1),
    MODELS, 
    SOURCES
)
datasource_turning_tests

Table 7.9: Statistical comparison of TURNING performance obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.151	0.412	0.344	23.926	0.000	0.589	0.648	0.534	18.943	0.000	0.402	0.514	0.478	11.056	0.004	0.568	0.574	0.497	7.935	0.019
2	0.636	0.614	0.451	22.429	0.000	0.779	0.692	0.761	12.047	0.002	0.744	0.642	0.768	31.916	0.000	0.763	0.663	0.682	21.681	0.000
3	0.733	0.674	0.648	14.444	0.001	0.796	0.728	0.796	29.461	0.000	0.784	0.715	0.783	39.760	0.000	0.773	0.709	0.759	22.098	0.000
4	0.783	0.670	0.738	32.274	0.000	0.820	0.742	0.810	39.893	0.000	0.813	0.728	0.818	56.020	0.000	0.818	0.727	0.785	39.105	0.000
5	0.829	0.701	0.784	71.408	0.000	0.826	0.753	0.827	51.768	0.000	0.811	0.746	0.835	91.561	0.000	0.839	0.741	0.809	77.591	0.000
6	0.821	0.714	0.797	55.737	0.000	0.827	0.766	0.823	32.806	0.000	0.812	0.761	0.827	44.326	0.000	0.827	0.757	0.822	43.343	0.000
7	0.840	0.716	0.828	100.433	0.000	0.842	0.773	0.837	55.313	0.000	0.825	0.763	0.835	56.482	0.000	0.844	0.758	0.830	61.091	0.000
8	0.836	0.723	0.834	148.564	0.000	0.849	0.772	0.839	68.741	0.000	0.832	0.780	0.841	51.793	0.000	0.846	0.759	0.832	70.827	0.000
9	0.847	0.728	0.836	170.804	0.000	0.843	0.786	0.849	66.251	0.000	0.838	0.773	0.845	79.757	0.000	0.839	0.781	0.840	63.635	0.000
10	0.854	0.734	0.846	173.315	0.000	0.854	0.793	0.849	70.227	0.000	0.851	0.786	0.851	67.842	0.000	0.844	0.786	0.847	65.421	0.000
11	0.845	0.728	0.850	191.693	0.000	0.852	0.784	0.849	76.988	0.000	0.845	0.784	0.850	78.618	0.000	0.846	0.782	0.839	51.429	0.000
12	0.853	0.726	0.859	217.040	0.000	0.859	0.794	0.849	79.960	0.000	0.848	0.786	0.855	61.309	0.000	0.848	0.790	0.854	73.268	0.000
13	0.856	0.736	0.859	226.242	0.000	0.864	0.799	0.860	104.813	0.000	0.852	0.788	0.851	79.701	0.000	0.860	0.785	0.859	82.505	0.000
14	0.856	0.743	0.862	229.817	0.000	0.860	0.797	0.862	96.266	0.000	0.850	0.802	0.855	79.526	0.000	0.856	0.791	0.855	78.966	0.000
15	0.861	0.745	0.865	274.935	0.000	0.867	0.802	0.860	108.313	0.000	0.854	0.799	0.860	78.817	0.000	0.857	0.795	0.856	77.251	0.000
16	0.862	0.738	0.863	261.219	0.000	0.867	0.802	0.862	95.275	0.000	0.860	0.807	0.860	65.828	0.000	0.856	0.794	0.857	87.502	0.000
17	0.866	0.735	0.868	283.156	0.000	0.871	0.801	0.862	112.284	0.000	0.855	0.806	0.859	61.410	0.000	0.860	0.806	0.860	87.450	0.000
18	0.864	0.743	0.862	272.812	0.000	0.866	0.802	0.867	97.861	0.000	0.858	0.805	0.860	66.906	0.000	0.858	0.800	0.862	80.067	0.000
19	0.866	0.742	0.864	296.156	0.000	0.869	0.807	0.869	105.937	0.000	0.853	0.812	0.859	65.296	0.000	0.860	0.805	0.863	87.774	0.000
20	0.864	0.742	0.868	312.355	0.000	0.870	0.812	0.868	112.034	0.000	0.859	0.815	0.860	62.550	0.000	0.864	0.799	0.866	95.319	0.000
21	0.865	0.750	0.870	312.882	0.000	0.874	0.819	0.869	111.421	0.000	0.860	0.812	0.859	69.801	0.000	0.862	0.803	0.864	84.037	0.000
22	0.864	0.743	0.868	305.136	0.000	0.870	0.812	0.870	111.311	0.000	0.861	0.817	0.861	67.998	0.000	0.865	0.807	0.866	87.498	0.000

Code

show(datasource_turning_posthoc, classes="display nowrap compact")

Table 7.10: Post-hoc tests for Table 7.9 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`SITTING_DOWN`

The results shown in Table 7.11, 7.12 present similar patterns to those of the STANDING_UP activity. On the one hand, in the MLP, CNN and CNN-LSTM, the smartwatch-trained models are the best-performing with low and medium amounts of data while the fused-trained models provide the best metrics with medium and high amounts of data. On the other hand, the models trained with smartwatch and fused data are the best on the LSTM models.

As in the STANDING_UP activity, aside from the superiority of the fused-trained models, the smartwatch-trained models outperform the smartphone-trained models in the CNN and the CNN-LSTM, while the opposite applies for MLP models.

Code

datasource_sitting_tests, datasource_sitting_posthoc = statistical_comparison(
    reports,
    (TargetFilter.SITTING_DOWN, ActivityMetric.F1),
    MODELS, 
    SOURCES
)
datasource_sitting_tests

Table 7.11: Statistical comparison of SITTING_DOWN performance obtained by the data sources for each model.

	mlp					cnn					lstm					cnn-lstm
	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value	sp	sw	fused	H(2)	p-value
n
1	0.108	0.368	0.100	125.001	0.0	0.102	0.583	0.238	185.743	0.0	0.094	0.499	0.252	190.167	0.0	0.152	0.479	0.168	164.098	0.0
2	0.207	0.458	0.346	62.538	0.0	0.363	0.657	0.581	83.232	0.0	0.305	0.649	0.511	108.515	0.0	0.265	0.624	0.478	80.936	0.0
3	0.286	0.535	0.443	94.438	0.0	0.473	0.725	0.671	82.024	0.0	0.406	0.691	0.610	123.758	0.0	0.471	0.717	0.620	91.994	0.0
4	0.420	0.603	0.569	59.284	0.0	0.619	0.754	0.712	43.731	0.0	0.502	0.718	0.680	92.819	0.0	0.572	0.730	0.709	60.015	0.0
5	0.507	0.632	0.612	37.638	0.0	0.651	0.768	0.753	53.262	0.0	0.547	0.750	0.654	90.598	0.0	0.613	0.748	0.754	62.946	0.0
6	0.518	0.633	0.642	42.385	0.0	0.650	0.786	0.779	53.718	0.0	0.617	0.766	0.711	87.575	0.0	0.632	0.766	0.780	73.800	0.0
7	0.594	0.659	0.658	24.641	0.0	0.704	0.793	0.800	46.566	0.0	0.597	0.778	0.738	85.029	0.0	0.680	0.778	0.812	64.448	0.0
8	0.612	0.667	0.679	25.609	0.0	0.705	0.806	0.811	47.750	0.0	0.634	0.794	0.754	76.895	0.0	0.693	0.786	0.807	67.641	0.0
9	0.659	0.688	0.698	15.948	0.0	0.738	0.822	0.827	42.565	0.0	0.689	0.808	0.785	47.875	0.0	0.692	0.806	0.815	68.162	0.0
10	0.660	0.695	0.731	25.724	0.0	0.751	0.829	0.841	55.646	0.0	0.736	0.815	0.792	45.626	0.0	0.724	0.802	0.828	56.691	0.0
11	0.669	0.695	0.728	19.301	0.0	0.766	0.828	0.828	60.149	0.0	0.739	0.809	0.786	52.023	0.0	0.727	0.811	0.828	78.245	0.0
12	0.687	0.697	0.743	23.797	0.0	0.765	0.824	0.843	46.111	0.0	0.743	0.818	0.814	41.332	0.0	0.736	0.805	0.842	71.700	0.0
13	0.710	0.701	0.765	26.949	0.0	0.774	0.827	0.850	51.775	0.0	0.753	0.817	0.809	30.991	0.0	0.756	0.804	0.856	82.744	0.0
14	0.717	0.706	0.762	23.874	0.0	0.784	0.832	0.860	50.096	0.0	0.765	0.823	0.831	32.109	0.0	0.753	0.811	0.849	83.008	0.0
15	0.713	0.714	0.783	45.473	0.0	0.787	0.834	0.857	48.778	0.0	0.778	0.828	0.824	22.071	0.0	0.761	0.814	0.857	85.721	0.0
16	0.714	0.704	0.785	57.454	0.0	0.791	0.845	0.848	38.998	0.0	0.783	0.831	0.838	35.341	0.0	0.759	0.823	0.865	104.325	0.0
17	0.734	0.720	0.786	48.910	0.0	0.809	0.837	0.857	41.412	0.0	0.796	0.833	0.832	17.900	0.0	0.774	0.822	0.860	71.749	0.0
18	0.739	0.715	0.785	47.594	0.0	0.800	0.846	0.869	59.795	0.0	0.795	0.833	0.837	21.110	0.0	0.768	0.824	0.866	87.083	0.0
19	0.743	0.722	0.800	62.711	0.0	0.806	0.845	0.873	54.034	0.0	0.786	0.844	0.839	24.038	0.0	0.769	0.823	0.870	93.933	0.0
20	0.750	0.714	0.802	72.130	0.0	0.807	0.838	0.873	58.636	0.0	0.796	0.833	0.850	31.334	0.0	0.783	0.822	0.876	98.424	0.0
21	0.744	0.730	0.813	61.236	0.0	0.816	0.840	0.872	47.147	0.0	0.806	0.841	0.840	16.009	0.0	0.788	0.822	0.871	92.640	0.0
22	0.768	0.726	0.810	87.104	0.0	0.821	0.850	0.879	51.039	0.0	0.800	0.842	0.846	23.175	0.0	0.790	0.829	0.876	91.725	0.0

Code

show(datasource_sitting_posthoc, classes="display nowrap compact")

Table 7.12: Post-hoc tests for Table 7.11 to determine the best-performant data sources.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

Summary

The obtained results show that the smartwatch-trained models have the best overall accuracy and activities F1-score across all models with low amounts of data. On the other side, models trained with the fused dataset present the best overall accuracy results with higher amounts of data.

When focusing on individual activities, the smartwatch-trained models are the best to recognize the SEATED activity. The smartwatch-trained models are also the best for the STANDING_UP activity with low and medium amounts of data, while with higher amounts the fused-trained models show better results. In the WALKING activity, the fused-trained models obtain the best results across all models, while the smartphone-trained models dataset joins them in the MLP and CNN models. Similarly, the models trained with the smartphone and the fused datasets are the best in the TURNING activity. For the SITTING_DOWN activity, the smartwatch-trained models are good with a low quantity of data while the fused-trained models are the best with medium and higher quantities. It is worth noting that the patterns observed in the STANDING_UP and SITTING_DOWN activities are very similar, which can be explained due to the inverse nature of these movements.

While the models trained with the fused dataset usually show the best results, sometimes they are not statistically better than the results obtained with the other datasets. In other words, the fact that the best results are presented by the fused-trained models and also by the smartphone or smartwatch ones implies that the fusion of the data is not always worth it. For example, in the TURNING activity, the smartphone- and the fused-trained models are always the best, which indicates that the fusion of smartphone and smartwatch data does not improve the smartphone results. The same occurs in the WALKING activity with the MLP and CNN models. However, the fusion is worth it for the remaining models in that activity and the STANDING_UP or SITTING_DOWN activities.

These results are visually summarized and simplified in Figure 7.1, 7.2. Following, some examples are given to show how to interpret the figure: in the WALKING activity and the CNN model, for \(n=1\), the statistically best metrics are obtained with the smartwatch dataset; for \(n=2\), the best metrics are obtained with the smartwatch dataset, although not statistically better compared with another dataset (whether smartphone or fused, should be determined by checking Table 7.7), and for \(n=4\), no significant difference is observed between data sources.

Given the obtained results, it is not possible to determine a clear winner. In the end, the most suitable data source will depend on the amount of data that can be collected, the target activities and the selected model – in line with . We could determine that the fused dataset would be the best option with any model and a moderate amount of data, while the smartwatch dataset would be good for the SEATED activity and the smartphone dataset works fine with the TURNING activity. The smartwatch dataset would also be the preferred choice for the STANDING_UP and SITTING_DOWN activities using the LSTM model.

Code

sources_results = {
    TargetFilter.MODEL: datasource_overlall_tests, 
    TargetFilter.SEATED: datasource_seated_tests, 
    TargetFilter.STANDING_UP: datasource_standing_tests, 
    TargetFilter.WALKING: datasource_walking_tests, 
    TargetFilter.TURNING: datasource_turning_tests, 
    TargetFilter.SITTING_DOWN: datasource_sitting_tests
}

best_sources = obtain_best_items(sources_results, SOURCES, MODELS)
significance_sources = load_best_significant(SIGNIFICANCE_FILE)

plot_visual_comparison(best_sources, significance_sources, SOURCES, MODELS)

Figure 7.1: Graphical representation of best data sources for each metric, model and amount of data combination. Symbology: ■ (SP), ◆ (SW) and ● (FUSED).

Code

plot_visual_ties(best_sources, significance_sources, SOURCES, MODELS)

Figure 7.2: Visual representation of performance ties. The plot indicates the amount of times a specific data source tied with each other.

Code reference

Tip

The documentation of the Python functions employed in this section can be found in Chapter 3 reference: