Statistical comparison of models performance

This section aims to determine the existence of statistical differences in the performance of the selected models, i.e., MLP, CNN, LSTM and CNN-LSTM, with each dataset.

Plotly loading issue

This page contains Plotly interactive figures. Sometimes, the figures might not load properly and show a blank image. Reloading the page might solve the loading issue.

Note

As shown in Impact of the amount of training data, the models results do not follow a normal distribution. Therefore, the following comparisons employ non-parametric tests.

Code

import os

from itables import show
from libs.chapter3.analysis.data_loading import load_reports, load_best_significant
from libs.chapter3.analysis.model import Model, Source, ActivityMetric, ModelMetric, TargetFilter, obtain_best_items
from libs.chapter3.analysis.statistical_tests import statistical_comparison
from libs.chapter3.analysis.visualization import plot_visual_comparison, plot_visual_ties

MODELS = [Model.MLP, Model.CNN, Model.LSTM, Model.CNN_LSTM]
SOURCES = [Source.SP, Source.SW, Source.FUSED]
SIGNIFICANCE_FILE = os.path.join('data', 'chapter3', 'significance', 'best_models.csv')

reports = load_reports()

Overall performance

Table 8.1, 8.2 show that the CNN is the best-performant model in any dataset and any amount of data.

In the smartphone dataset, the CNN-LSTM also performs well for low amounts of data, while with higher quantities of data, there are no statistical differences among the MLP, LSTM and CNN-LSTM. Regarding the smartwatch dataset, the LSTM and CNN-LSTM perform best with medium and higher amounts of data, along with the CNN model.

For the fused dataset, the CNN and the CNN-LSTM show the best accuracies with any amount of data, while the MLP model significantly shows the worst results.

Code

models_overlall_tests, models_overall_posthoc = statistical_comparison(
    reports,
    (TargetFilter.MODEL, ModelMetric.ACCURACY), 
    SOURCES, 
    MODELS
)
models_overlall_tests

Table 8.1: Statistical comparison of overall accuracies obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.202	0.390	0.283	0.426	63.138	0.0	0.528	0.678	0.650	0.624	132.544	0.0	0.242	0.511	0.437	0.475	66.589	0.0
2	0.491	0.659	0.583	0.682	49.458	0.0	0.679	0.755	0.728	0.739	71.994	0.0	0.491	0.716	0.661	0.654	69.982	0.0
3	0.609	0.719	0.665	0.710	54.824	0.0	0.710	0.788	0.776	0.774	89.297	0.0	0.624	0.787	0.739	0.767	108.814	0.0
4	0.700	0.776	0.694	0.759	44.015	0.0	0.736	0.807	0.791	0.801	108.119	0.0	0.729	0.821	0.792	0.799	65.616	0.0
5	0.743	0.811	0.753	0.800	45.246	0.0	0.766	0.829	0.812	0.816	90.656	0.0	0.759	0.838	0.796	0.830	81.758	0.0
6	0.739	0.798	0.758	0.793	30.480	0.0	0.777	0.834	0.829	0.823	94.456	0.0	0.788	0.849	0.812	0.839	70.833	0.0
7	0.782	0.837	0.771	0.819	44.531	0.0	0.778	0.838	0.831	0.823	113.771	0.0	0.795	0.852	0.821	0.847	75.129	0.0
8	0.789	0.836	0.786	0.821	38.073	0.0	0.795	0.840	0.842	0.833	107.121	0.0	0.802	0.858	0.832	0.852	70.929	0.0
9	0.813	0.843	0.813	0.823	24.215	0.0	0.803	0.857	0.846	0.852	128.881	0.0	0.819	0.872	0.849	0.869	78.168	0.0
10	0.819	0.854	0.826	0.838	30.906	0.0	0.807	0.858	0.851	0.849	122.873	0.0	0.840	0.875	0.860	0.874	60.891	0.0
11	0.825	0.856	0.826	0.838	22.820	0.0	0.803	0.854	0.850	0.849	123.208	0.0	0.840	0.874	0.861	0.870	49.824	0.0
12	0.831	0.855	0.830	0.835	17.955	0.0	0.803	0.858	0.852	0.854	142.441	0.0	0.847	0.877	0.866	0.883	61.943	0.0
13	0.837	0.861	0.846	0.851	18.860	0.0	0.810	0.859	0.856	0.856	149.958	0.0	0.849	0.887	0.867	0.888	87.474	0.0
14	0.841	0.869	0.848	0.854	19.856	0.0	0.815	0.863	0.861	0.856	159.488	0.0	0.855	0.889	0.875	0.886	64.162	0.0
15	0.842	0.869	0.850	0.852	24.013	0.0	0.815	0.864	0.863	0.858	158.026	0.0	0.861	0.889	0.875	0.890	59.296	0.0
16	0.841	0.870	0.850	0.853	31.019	0.0	0.814	0.868	0.866	0.860	168.872	0.0	0.864	0.890	0.882	0.894	75.081	0.0
17	0.851	0.877	0.855	0.856	24.430	0.0	0.816	0.867	0.867	0.861	170.798	0.0	0.867	0.895	0.878	0.894	72.092	0.0
18	0.846	0.877	0.854	0.858	27.748	0.0	0.815	0.867	0.869	0.864	170.957	0.0	0.867	0.896	0.879	0.894	72.640	0.0
19	0.856	0.873	0.855	0.859	23.246	0.0	0.821	0.869	0.870	0.868	175.237	0.0	0.872	0.896	0.884	0.896	69.911	0.0
20	0.852	0.880	0.864	0.859	27.445	0.0	0.817	0.872	0.872	0.866	178.248	0.0	0.876	0.898	0.886	0.897	66.134	0.0
21	0.854	0.884	0.864	0.863	38.568	0.0	0.824	0.871	0.871	0.867	170.556	0.0	0.873	0.897	0.885	0.900	58.449	0.0
22	0.861	0.886	0.865	0.868	34.760	0.0	0.822	0.874	0.875	0.869	198.656	0.0	0.879	0.902	0.889	0.901	65.920	0.0

Code

show(models_overall_posthoc, classes="display nowrap compact")

Table 8.2: Post-hoc tests for Table 8.1 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

Activity-wise performance

The following sections address the performance of the selected models in each activity and data source.

`SEATED`

Table 8.3, 8.4 show different results regarding the data source.

Using the smartphone dataset, the CNN-LSTM seems to perform well with low and high quantities of data, while the CNN and LSTM are also the best with high amounts of data. With the smartwatch dataset, the best results are obtained by the CNN, LSTM and CNN-LSTM with low amounts of data, while with higher amounts no significant differences are observed among models. Regarding the fused dataset, the CNN and the CNN-LSTM are the best-performing models, followed by the LSTM. The MLP provides the worst results.

Code

models_seated_tests, models_seated_posthoc = statistical_comparison(
    reports,
    (TargetFilter.SEATED, ActivityMetric.F1), 
    SOURCES, 
    MODELS
)
models_seated_tests

Table 8.3: Statistical comparison of SEATED performance obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.065	0.000	0.083	0.084	11.680	0.009	0.397	0.772	0.749	0.644	149.050	0.000	0.116	0.388	0.316	0.136	15.160	0.002
2	0.210	0.273	0.176	0.737	65.804	0.000	0.706	0.830	0.823	0.826	85.527	0.000	0.456	0.716	0.650	0.705	31.707	0.000
3	0.245	0.593	0.431	0.788	160.718	0.000	0.805	0.849	0.844	0.850	44.039	0.000	0.543	0.800	0.741	0.816	117.621	0.000
4	0.498	0.733	0.587	0.800	79.843	0.000	0.840	0.864	0.851	0.858	19.856	0.000	0.673	0.823	0.800	0.827	79.399	0.000
5	0.660	0.783	0.653	0.812	82.175	0.000	0.841	0.868	0.856	0.858	17.609	0.001	0.685	0.839	0.800	0.853	97.686	0.000
6	0.645	0.774	0.669	0.797	63.842	0.000	0.850	0.869	0.857	0.875	24.199	0.000	0.720	0.839	0.800	0.849	75.313	0.000
7	0.727	0.794	0.735	0.821	70.529	0.000	0.849	0.866	0.862	0.872	12.677	0.005	0.702	0.854	0.828	0.868	110.599	0.000
8	0.738	0.810	0.758	0.811	57.865	0.000	0.868	0.873	0.867	0.877	7.535	0.057	0.717	0.861	0.833	0.870	126.446	0.000
9	0.778	0.833	0.797	0.836	38.683	0.000	0.871	0.880	0.870	0.879	8.419	0.038	0.793	0.861	0.845	0.870	69.917	0.000
10	0.767	0.830	0.794	0.833	47.298	0.000	0.872	0.877	0.871	0.881	9.849	0.020	0.787	0.872	0.844	0.875	89.605	0.000
11	0.780	0.827	0.799	0.830	38.571	0.000	0.870	0.878	0.866	0.878	4.333	0.228	0.798	0.867	0.844	0.865	68.845	0.000
12	0.787	0.833	0.811	0.833	28.954	0.000	0.876	0.880	0.876	0.882	4.283	0.232	0.808	0.876	0.862	0.875	71.107	0.000
13	0.807	0.831	0.806	0.831	16.346	0.001	0.877	0.884	0.877	0.878	5.176	0.159	0.816	0.871	0.847	0.874	69.197	0.000
14	0.802	0.842	0.828	0.828	18.119	0.000	0.877	0.885	0.872	0.879	3.501	0.321	0.813	0.877	0.863	0.874	63.730	0.000
15	0.801	0.837	0.830	0.833	18.467	0.000	0.873	0.882	0.876	0.882	3.256	0.354	0.835	0.872	0.858	0.880	54.007	0.000
16	0.809	0.847	0.823	0.843	27.844	0.000	0.875	0.885	0.887	0.886	6.688	0.083	0.841	0.876	0.857	0.880	52.211	0.000
17	0.807	0.848	0.830	0.836	15.002	0.002	0.881	0.880	0.876	0.881	1.035	0.793	0.833	0.877	0.865	0.880	55.932	0.000
18	0.807	0.842	0.837	0.843	16.857	0.001	0.882	0.885	0.886	0.882	2.252	0.522	0.838	0.875	0.861	0.880	50.721	0.000
19	0.812	0.842	0.829	0.837	11.724	0.008	0.880	0.889	0.878	0.881	3.603	0.308	0.846	0.875	0.865	0.885	53.275	0.000
20	0.813	0.851	0.830	0.844	20.166	0.000	0.885	0.884	0.883	0.884	1.086	0.780	0.842	0.877	0.868	0.883	51.086	0.000
21	0.814	0.854	0.842	0.837	17.413	0.001	0.875	0.881	0.886	0.883	4.813	0.186	0.844	0.875	0.867	0.885	56.551	0.000
22	0.826	0.850	0.851	0.842	13.438	0.004	0.879	0.887	0.887	0.883	5.229	0.156	0.845	0.875	0.867	0.885	48.592	0.000

Code

show(models_seated_posthoc, classes="display nowrap compact")

Table 8.4: Post-hoc tests for Table 8.3 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`STANDING_UP`

The results in Table 8.5, 8.6 show that the CNN models are the best-performing with any amount of data and data source. The CNN-LSTM are also the best-performing with the smartphone and fused datasets with any quantity of data, while with the smartwatch dataset it struggles with low quantities of data. The LSTM also performs well with high amounts of data using the smartphone and smartwatch datasets. It also provides better results than the MLP with the fused dataset.

Code

models_standing_tests, models_standing_posthoc = statistical_comparison(
    reports,
    (TargetFilter.STANDING_UP, ActivityMetric.F1), 
    SOURCES, 
    MODELS
)
models_standing_tests

Table 8.5: Statistical comparison of STANDING_UP performance obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.060	0.130	0.093	0.139	7.222	0.065	0.434	0.583	0.546	0.517	75.799	0.0	0.141	0.320	0.290	0.275	42.826	0.0
2	0.209	0.362	0.272	0.318	8.956	0.030	0.555	0.675	0.632	0.656	62.556	0.0	0.320	0.563	0.481	0.487	37.790	0.0
3	0.324	0.482	0.369	0.494	23.561	0.000	0.563	0.716	0.667	0.681	118.717	0.0	0.469	0.667	0.550	0.646	56.076	0.0
4	0.485	0.609	0.470	0.600	20.113	0.000	0.613	0.767	0.717	0.730	128.312	0.0	0.594	0.738	0.670	0.722	43.020	0.0
5	0.496	0.657	0.541	0.638	29.682	0.000	0.634	0.781	0.753	0.760	145.508	0.0	0.614	0.749	0.699	0.742	54.173	0.0
6	0.514	0.643	0.528	0.626	13.981	0.003	0.637	0.800	0.771	0.772	164.682	0.0	0.626	0.766	0.701	0.760	44.998	0.0
7	0.590	0.693	0.564	0.699	28.916	0.000	0.658	0.809	0.768	0.786	169.250	0.0	0.667	0.777	0.713	0.777	65.890	0.0
8	0.607	0.713	0.642	0.699	21.332	0.000	0.674	0.804	0.796	0.788	158.415	0.0	0.678	0.793	0.732	0.793	56.438	0.0
9	0.651	0.745	0.704	0.726	18.538	0.000	0.694	0.823	0.811	0.805	198.948	0.0	0.731	0.813	0.777	0.809	51.521	0.0
10	0.643	0.769	0.707	0.743	30.327	0.000	0.698	0.830	0.824	0.818	182.902	0.0	0.732	0.818	0.789	0.822	49.148	0.0
11	0.681	0.775	0.724	0.771	22.470	0.000	0.696	0.823	0.814	0.814	185.085	0.0	0.735	0.827	0.791	0.824	51.660	0.0
12	0.675	0.748	0.721	0.756	14.996	0.002	0.688	0.830	0.825	0.827	208.469	0.0	0.740	0.825	0.786	0.840	50.615	0.0
13	0.696	0.775	0.762	0.771	18.254	0.000	0.693	0.833	0.832	0.819	202.887	0.0	0.766	0.841	0.800	0.847	63.189	0.0
14	0.712	0.788	0.765	0.777	19.215	0.000	0.697	0.832	0.835	0.826	189.750	0.0	0.755	0.838	0.813	0.839	50.068	0.0
15	0.721	0.792	0.765	0.769	21.447	0.000	0.703	0.845	0.842	0.839	220.215	0.0	0.769	0.848	0.821	0.851	53.440	0.0
16	0.726	0.800	0.765	0.791	22.335	0.000	0.715	0.843	0.841	0.836	203.906	0.0	0.779	0.857	0.830	0.861	68.146	0.0
17	0.738	0.791	0.780	0.783	12.886	0.005	0.712	0.842	0.843	0.832	222.401	0.0	0.778	0.857	0.825	0.857	60.781	0.0
18	0.724	0.805	0.779	0.786	20.594	0.000	0.716	0.843	0.854	0.833	198.738	0.0	0.789	0.859	0.825	0.860	53.781	0.0
19	0.753	0.809	0.788	0.798	17.009	0.001	0.712	0.847	0.846	0.831	199.322	0.0	0.794	0.860	0.844	0.870	61.465	0.0
20	0.749	0.805	0.789	0.800	18.931	0.000	0.714	0.850	0.852	0.838	211.687	0.0	0.798	0.869	0.845	0.862	48.278	0.0
21	0.753	0.809	0.800	0.802	21.894	0.000	0.717	0.847	0.846	0.837	206.533	0.0	0.800	0.871	0.847	0.865	44.830	0.0
22	0.745	0.821	0.792	0.807	26.722	0.000	0.714	0.857	0.851	0.844	220.480	0.0	0.800	0.871	0.859	0.872	51.819	0.0

Code

show(models_standing_posthoc, classes="display nowrap compact")

Table 8.6: Post-hoc tests for Table 8.5 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`WALKING`

The results shown in Table 8.7, 8.8 indicate that the CNN models provide the best results with any amount of data across the three data sources. The CNN-LSTM obtains good results with low amounts of data with the smartphone dataset, while it also produces the best results with medium and high quantities of data with the fused dataset. The LSTM performs well using the smartwatch dataset, similar to the CNN. The MLP provides the worst results with the smartwatch dataset, although its results are not different from the LSTM using the smartphone and fused datasets.

Code

models_walking_tests, models_walking_posthoc = statistical_comparison(
    reports,
    (TargetFilter.WALKING, ActivityMetric.F1), 
    SOURCES, 
    MODELS
)
models_walking_tests

Table 8.7: Statistical comparison of WALKING performance obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.027	0.471	0.347	0.509	32.545	0.000	0.679	0.765	0.734	0.736	41.215	0.0	0.071	0.620	0.576	0.603	74.004	0.000
2	0.599	0.775	0.728	0.794	36.256	0.000	0.778	0.833	0.801	0.810	22.720	0.0	0.667	0.811	0.759	0.768	43.791	0.000
3	0.764	0.826	0.784	0.820	37.277	0.000	0.795	0.852	0.841	0.843	28.667	0.0	0.780	0.861	0.803	0.836	60.772	0.000
4	0.817	0.859	0.799	0.847	31.366	0.000	0.815	0.865	0.860	0.861	53.271	0.0	0.813	0.875	0.847	0.853	44.495	0.000
5	0.856	0.889	0.838	0.874	39.282	0.000	0.835	0.883	0.870	0.867	46.765	0.0	0.847	0.888	0.853	0.883	49.080	0.000
6	0.845	0.874	0.833	0.874	25.530	0.000	0.848	0.886	0.877	0.870	39.992	0.0	0.856	0.893	0.860	0.885	40.431	0.000
7	0.874	0.899	0.842	0.898	47.897	0.000	0.849	0.887	0.883	0.871	54.890	0.0	0.876	0.901	0.880	0.894	29.142	0.000
8	0.870	0.895	0.848	0.884	35.927	0.000	0.855	0.886	0.889	0.879	47.876	0.0	0.882	0.904	0.877	0.898	30.463	0.000
9	0.893	0.905	0.882	0.887	21.562	0.000	0.859	0.900	0.893	0.893	73.398	0.0	0.887	0.915	0.890	0.914	44.828	0.000
10	0.898	0.910	0.884	0.904	23.344	0.000	0.865	0.901	0.894	0.892	61.901	0.0	0.899	0.915	0.904	0.913	22.228	0.000
11	0.898	0.907	0.879	0.899	15.977	0.001	0.868	0.897	0.894	0.892	60.148	0.0	0.899	0.912	0.902	0.912	14.110	0.003
12	0.905	0.911	0.887	0.896	18.304	0.000	0.866	0.901	0.899	0.893	78.130	0.0	0.904	0.914	0.904	0.919	26.160	0.000
13	0.905	0.917	0.900	0.906	17.876	0.000	0.869	0.901	0.898	0.896	80.415	0.0	0.908	0.924	0.907	0.925	48.442	0.000
14	0.909	0.916	0.903	0.910	14.686	0.002	0.870	0.907	0.906	0.898	102.948	0.0	0.908	0.925	0.917	0.925	31.878	0.000
15	0.909	0.922	0.902	0.908	18.750	0.000	0.874	0.908	0.905	0.900	85.587	0.0	0.915	0.925	0.914	0.927	33.798	0.000
16	0.910	0.925	0.904	0.909	29.269	0.000	0.869	0.909	0.905	0.899	100.920	0.0	0.914	0.925	0.919	0.930	39.304	0.000
17	0.914	0.925	0.903	0.910	24.184	0.000	0.871	0.908	0.909	0.902	106.378	0.0	0.918	0.928	0.916	0.928	33.331	0.000
18	0.913	0.928	0.909	0.914	23.263	0.000	0.873	0.908	0.906	0.902	100.240	0.0	0.916	0.928	0.918	0.933	36.701	0.000
19	0.916	0.927	0.905	0.918	31.125	0.000	0.875	0.908	0.906	0.904	106.258	0.0	0.918	0.929	0.919	0.932	34.601	0.000
20	0.916	0.928	0.912	0.917	24.246	0.000	0.874	0.912	0.910	0.906	109.615	0.0	0.922	0.930	0.925	0.932	30.369	0.000
21	0.917	0.931	0.915	0.918	35.637	0.000	0.880	0.913	0.913	0.905	97.691	0.0	0.923	0.931	0.925	0.933	26.441	0.000
22	0.918	0.932	0.912	0.921	32.243	0.000	0.877	0.913	0.911	0.907	126.709	0.0	0.923	0.933	0.925	0.935	34.526	0.000

Code

show(models_walking_posthoc, classes="display nowrap compact")

Table 8.8: Post-hoc tests for Table 8.7 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`TURNING`

Table 8.9, 8.10 show that the CNN, LSTM and CNN-LSTM obtain the best results in the smartwatch dataset. These three models also perform well with low amounts of data using the smartphone and fused datasets. However, no significant differences among models are observed after \(n \geq 5\) and \(n \geq 7\), respectively. In the case of the smartphone dataset, significant differences appear after \(n \geq 15\), with the MLP and the CNN being the best. With the fused dataset, after \(n \geq 21\), the LSTM provides the significantly worse results.

Code

models_turning_tests, models_turning_posthoc = statistical_comparison(
    reports,
    (TargetFilter.TURNING, ActivityMetric.F1), 
    SOURCES, 
    MODELS
)
models_turning_tests

Table 8.9: Statistical comparison of TURNING performance obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.151	0.589	0.402	0.568	100.751	0.000	0.412	0.648	0.514	0.574	232.008	0.0	0.344	0.534	0.478	0.497	82.072	0.000
2	0.636	0.779	0.744	0.763	27.331	0.000	0.614	0.692	0.642	0.663	67.321	0.0	0.451	0.761	0.768	0.682	103.413	0.000
3	0.733	0.796	0.784	0.773	19.168	0.000	0.674	0.728	0.715	0.709	57.915	0.0	0.648	0.796	0.783	0.759	94.346	0.000
4	0.783	0.820	0.813	0.818	11.748	0.008	0.670	0.742	0.728	0.727	67.707	0.0	0.738	0.810	0.818	0.785	35.390	0.000
5	0.829	0.826	0.811	0.839	7.205	0.066	0.701	0.753	0.746	0.741	47.987	0.0	0.784	0.827	0.835	0.809	35.670	0.000
6	0.821	0.827	0.812	0.827	5.653	0.130	0.714	0.766	0.761	0.757	64.089	0.0	0.797	0.823	0.827	0.822	18.993	0.000
7	0.840	0.842	0.825	0.844	5.340	0.149	0.716	0.773	0.763	0.758	64.602	0.0	0.828	0.837	0.835	0.830	8.279	0.041
8	0.836	0.849	0.832	0.846	5.351	0.148	0.723	0.772	0.780	0.759	67.717	0.0	0.834	0.839	0.841	0.832	3.417	0.332
9	0.847	0.843	0.838	0.839	3.940	0.268	0.728	0.786	0.773	0.781	72.782	0.0	0.836	0.849	0.845	0.840	6.762	0.080
10	0.854	0.854	0.851	0.844	1.941	0.585	0.734	0.793	0.786	0.786	76.116	0.0	0.846	0.849	0.851	0.847	0.834	0.841
11	0.845	0.852	0.845	0.846	3.307	0.347	0.728	0.784	0.784	0.782	78.787	0.0	0.850	0.849	0.850	0.839	7.096	0.069
12	0.853	0.859	0.848	0.848	7.396	0.060	0.726	0.794	0.786	0.790	95.204	0.0	0.859	0.849	0.855	0.854	2.081	0.556
13	0.856	0.864	0.852	0.860	7.294	0.063	0.736	0.799	0.788	0.785	95.504	0.0	0.859	0.860	0.851	0.859	1.590	0.662
14	0.856	0.860	0.850	0.856	3.968	0.265	0.743	0.797	0.802	0.791	88.731	0.0	0.862	0.862	0.855	0.855	6.602	0.086
15	0.861	0.867	0.854	0.857	10.448	0.015	0.745	0.802	0.799	0.795	95.701	0.0	0.865	0.860	0.860	0.856	4.725	0.193
16	0.862	0.867	0.860	0.856	8.346	0.039	0.738	0.802	0.807	0.794	101.760	0.0	0.863	0.862	0.860	0.857	0.397	0.941
17	0.866	0.871	0.855	0.860	14.598	0.002	0.735	0.801	0.806	0.806	116.932	0.0	0.868	0.862	0.859	0.860	5.646	0.130
18	0.864	0.866	0.858	0.858	9.824	0.020	0.743	0.802	0.805	0.800	107.298	0.0	0.862	0.867	0.860	0.862	3.121	0.373
19	0.866	0.869	0.853	0.860	14.999	0.002	0.742	0.807	0.812	0.805	106.796	0.0	0.864	0.869	0.859	0.863	3.746	0.290
20	0.864	0.870	0.859	0.864	13.366	0.004	0.742	0.812	0.815	0.799	126.077	0.0	0.868	0.868	0.860	0.866	5.491	0.139
21	0.865	0.874	0.860	0.862	15.351	0.002	0.750	0.819	0.812	0.803	125.665	0.0	0.870	0.869	0.859	0.864	12.210	0.007
22	0.864	0.870	0.861	0.865	8.253	0.041	0.743	0.812	0.817	0.807	134.359	0.0	0.868	0.870	0.861	0.866	10.585	0.014

Code

show(models_turning_posthoc, classes="display nowrap compact")

Table 8.10: Post-hoc tests for Table 8.9 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

`SITTING_DOWN`

The results from Table 8.11, 8.12 indicate that the CNN is the best-performing model with any quantity of data across data sources. The CNN-LSTM also performs well with any amount of data using the fused dataset, while also showing a good performance with low and medium amounts of data using the smartphone and smartwatch datasets. The LSTM performs well with high amounts of data using the smartphone and smartwatch datasets, and provides better results than the MLP model using the fused dataset. In the case of the MLP, it provides the worst results in any scenario.

Code

models_sitting_tests, models_sitting_posthoc = statistical_comparison(
    reports,
    (TargetFilter.SITTING_DOWN, ActivityMetric.F1), 
    SOURCES, 
    MODELS
)
models_sitting_tests

Table 8.11: Statistical comparison of SITTING_DOWN performance obtained by the models for each data source.

	sp						sw						fused
	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value	mlp	cnn	lstm	cnn-lstm	H(3)	p-value
n
1	0.108	0.102	0.094	0.152	3.785	0.286	0.368	0.583	0.499	0.479	139.149	0.0	0.100	0.238	0.252	0.168	39.136	0.0
2	0.207	0.363	0.305	0.265	11.881	0.008	0.458	0.657	0.649	0.624	190.870	0.0	0.346	0.581	0.511	0.478	33.878	0.0
3	0.286	0.473	0.406	0.471	35.016	0.000	0.535	0.725	0.691	0.717	170.778	0.0	0.443	0.671	0.610	0.620	54.230	0.0
4	0.420	0.619	0.502	0.572	35.908	0.000	0.603	0.754	0.718	0.730	147.828	0.0	0.569	0.712	0.680	0.709	46.529	0.0
5	0.507	0.651	0.547	0.613	32.061	0.000	0.632	0.768	0.750	0.748	140.511	0.0	0.612	0.753	0.654	0.754	91.191	0.0
6	0.518	0.650	0.617	0.632	27.322	0.000	0.633	0.786	0.766	0.766	176.413	0.0	0.642	0.779	0.711	0.780	79.841	0.0
7	0.594	0.704	0.597	0.680	38.170	0.000	0.659	0.793	0.778	0.778	166.434	0.0	0.658	0.800	0.738	0.812	109.438	0.0
8	0.612	0.705	0.634	0.693	33.109	0.000	0.667	0.806	0.794	0.786	168.501	0.0	0.679	0.811	0.754	0.807	94.447	0.0
9	0.659	0.738	0.689	0.692	30.200	0.000	0.688	0.822	0.808	0.806	175.911	0.0	0.698	0.827	0.785	0.815	92.280	0.0
10	0.660	0.751	0.736	0.724	35.525	0.000	0.695	0.829	0.815	0.802	169.903	0.0	0.731	0.841	0.792	0.828	88.379	0.0
11	0.669	0.766	0.739	0.727	29.183	0.000	0.695	0.828	0.809	0.811	178.988	0.0	0.728	0.828	0.786	0.828	95.685	0.0
12	0.687	0.765	0.743	0.736	22.593	0.000	0.697	0.824	0.818	0.805	179.573	0.0	0.743	0.843	0.814	0.842	101.743	0.0
13	0.710	0.774	0.753	0.756	19.137	0.000	0.701	0.827	0.817	0.804	185.931	0.0	0.765	0.850	0.809	0.856	105.213	0.0
14	0.717	0.784	0.765	0.753	21.136	0.000	0.706	0.832	0.823	0.811	202.297	0.0	0.762	0.860	0.831	0.849	106.044	0.0
15	0.713	0.787	0.778	0.761	26.514	0.000	0.714	0.834	0.828	0.814	186.076	0.0	0.783	0.857	0.824	0.857	75.894	0.0
16	0.714	0.791	0.783	0.759	37.802	0.000	0.704	0.845	0.831	0.823	218.240	0.0	0.785	0.848	0.838	0.865	95.678	0.0
17	0.734	0.809	0.796	0.774	28.286	0.000	0.720	0.837	0.833	0.822	182.075	0.0	0.786	0.857	0.832	0.860	86.950	0.0
18	0.739	0.800	0.795	0.768	30.999	0.000	0.715	0.846	0.833	0.824	191.156	0.0	0.785	0.869	0.837	0.866	120.677	0.0
19	0.743	0.806	0.786	0.769	27.103	0.000	0.722	0.845	0.844	0.823	204.957	0.0	0.800	0.873	0.839	0.870	86.678	0.0
20	0.750	0.807	0.796	0.783	32.416	0.000	0.714	0.838	0.833	0.822	201.035	0.0	0.802	0.873	0.850	0.876	92.564	0.0
21	0.744	0.816	0.806	0.788	37.110	0.000	0.730	0.840	0.841	0.822	180.729	0.0	0.813	0.872	0.840	0.871	91.379	0.0
22	0.768	0.821	0.800	0.790	29.114	0.000	0.726	0.850	0.842	0.829	211.321	0.0	0.810	0.879	0.846	0.876	94.619	0.0

Code

show(models_sitting_posthoc, classes="display nowrap compact")

Table 8.12: Post-hoc tests for Table 8.11 to determine the best-performant models.

		A	B	mean(A)	std(A)	mean(B)	std(B)	U	p-value
focus	n
Loading... (need help?)

Summary

The results obtained in the executed analyses show that the CNN is always the best-performing model in terms of overall accuracy for all data sources and any amount of data. The LSTM also performs well with the smartwatch dataset, and the CNN-LSTM with the smartwatch and fused datasets. The MLP is the worst performing model across data sources, where the differences between the smartwatch- and fused-trained models are significant.

Regarding activity-wise performance, the CNN model presents the best results in every activity and data source. The LSTM performs similarly to the CNN model using the smartwatch dataset in all activities and the smartphone dataset in all activities except TURNING. The performance of the CNN-LSTM seems to work well in some activities on the smartphone and smartwatch datasets, although their results are a bit unstable. On the other hand, the CNN-LSTM shines with the fused dataset, obtaining similar results as the CNN. The MLP model presents the worst results in every case except the WALKING and TURNING with the smartphone dataset.

These results are graphically summarized in Figure 8.1, 8.2, representing the best-performing model in terms of overall accuracy and activities F1-score. Following, some examples are given to show how to interpret the figure: in the SITTING_DOWN activity and for \(n=1\), when using the smartphone dataset no significant differences among model types are observed; with the smartwatch dataset, the CNN model statistically obtains the best performance, and with the fused dataset, the LSTM has the best performance, although not statistically better when compared with another model (whether MLP, CNN or CNN-LSTM, it should be determined by checking Table 8.11).

The figure shows a clear dominance by the CNN model, where it still provides the best metrics even when there is no significant difference compared to other models. In addition, when it does not provide the best results, they are still not significantly different from the best models on most occasions. It is noticeable the lack of influence of the model in the SEATED activity with smartwatch dataset, and in the TURNING activity with smartphone and fused datasets.

In summary, it can be stated that the CNN model is the best of the considered ones since it performs well in every situation. In addition, the LSTM and the CNN-LSTM would also be a feasible option when using smartwatch and fused data. The usage of the MLP model would be strongly discouraged since here and in related works it obtains the worst results.

Code

sources_results = {
    TargetFilter.MODEL: models_overlall_tests, 
    TargetFilter.SEATED: models_seated_tests, 
    TargetFilter.STANDING_UP: models_standing_tests, 
    TargetFilter.WALKING: models_walking_tests, 
    TargetFilter.TURNING: models_turning_tests, 
    TargetFilter.SITTING_DOWN: models_sitting_tests
}

best_sources = obtain_best_items(sources_results, MODELS, SOURCES)
significance_sources = load_best_significant(SIGNIFICANCE_FILE)

plot_visual_comparison(best_sources, significance_sources, MODELS, SOURCES)

Figure 8.1: Graphical representation of best models for each metric, data source and amount of data combination. Symbology: ▲ (MLP), ■ (CNN), ◆ (LSTM) and ● (CNN-LSTM).

Code

plot_visual_ties(best_sources, significance_sources, MODELS, SOURCES)

Figure 8.2: Visual representation of performance ties. The plot indicates the amount of times a specific model tied with each other.

Code reference

Tip

The documentation of the Python functions employed in this section can be found in Chapter 3 reference: