-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
406 lines (341 loc) · 23.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="description" content="">
<meta name="author" content="">
<title>Adax - Consumption habits, for a healthier way of life</title>
<!-- Bootstrap Core CSS -->
<link href="vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="vendor/fontawesome-free/css/all.min.css" rel="stylesheet" type="text/css">
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,700,300italic,400italic,700italic" rel="stylesheet" type="text/css">
<link href="vendor/simple-line-icons/css/simple-line-icons.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="css/stylish-portfolio.min.css" rel="stylesheet">
</head>
<body id="page-top">
<!-- Navigation -->
<a class="menu-toggle rounded" href="#">
<i class="fas fa-bars"></i>
</a>
<nav id="sidebar-wrapper">
<ul class="sidebar-nav">
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#page-top">Home</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#approach">Study design</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#finsights">First insights</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#parag2">Users clustering</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#parag3">Popularity score</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#parag4">Recommending better products</a>
</li>
<li class="sidebar-nav-item">
<a class="js-scroll-trigger" href="#conclusion">Conclusion</a>
</li>
</ul>
</nav>
<!-- Header -->
<header class="masthead d-flex">
<div class="container text-center my-auto">
<h1 class="mb-1">Consumption habits, for a healthier way of life</h1>
<h3 class="mb-5">
<em>A datastory on the Instacart Dataset, designed for the Applied Data Analysis Course @EPFL </em>
</h3>
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#approach">Explore !</a>
</div>
<div class="overlay"></div>
</header>
<div class="container text-center my-auto">
<p class="lead mg-5"> <i> Photography of an Addax (Addax nasomaculatus) at the Louisville Zoo, eating healthy grass.</i> <a href="https://fr.wikipedia.org/wiki/Fichier:Addax_at_the_Louisville_Zoo.jpg">Source here</a> </p>
</div>
<!-- Approach -->
<section class="content-section bg-light" id="approach">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Data for social good : improving users' consumption habits</h2>
<p class="lead mb-5">The study of consumption habits is a very hot topic, as it can be very useful for marketing issues, or advertisement.
However, our goal here is to use data <b> in order to help consumers directly. </b>
Through an analysis of the Instacart dataset, we aim to give the reader some advices regarding consumption to reorient clients’ habits toward a healthier
way of life. All in all, we provide the first steps of a smooth transition to a more responsible consumption.</p>
</div>
</div>
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">The Instacart dataset</h2>
<p class="lead mb-5">Instacart is a famous american company that provides an online shopping service. They made their real data for year 2017 publicly available.
This dataset consists in 4 different tables with data regarding the orders, the products and some categorization of the products. It also contains one dataset retracing the history of orders
per product and per client, see <a href="https://www.instacart.com/datasets/grocery-shopping-2017">here</a> for more details. </p>
</div>
</div>
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Our research questions</h2>
<p class="lead mb-5"> Exploring such a big dataset is a challenge, and you definitely need a strategy to dive into it. We thus chose to answer the following questions as guidelines:</p>
<ul class="lead">
<li>
Can we classify consumers depending on the categories of products they buy (aisles and departments)
and their repartition (fresh products, cooked meals, cans, …)?
</li>
<li>
Can we define one of this group as having the “healthier consumption”, toward which people should tend to?
</li>
What are the consumption habits of consumers groups, especially the healthier one (day of the week, hour of the day,
time since the previous order, number of orders, quantity ordered, reordering rate)?
<li>
Can we find equivalent products between the different consumers groups (by aisles),
and thus offer healthier alternative products ? Which one should be advised, based on reorders?
</li>
</ul>
<p class="lead mb-5"> Ready? <b> Let's dive ! </b> </p>
</div>
</div>
</div>
<div class="container text-center">
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#finsights">First insights</a>
</div>
</section>
<!-- Finsights -->
<section class="content-section bg-primary text-white" id="finsights">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">First Insights</h2>
<p class="lead mb-5">Before answering the previously defined questions, let's have a look at meaningful general insights on the dataset. Such insights
provide a good overview on users' habits.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">When do people commonly order?</h4>
<img src="./img/day_order.png" alt="Repartition of orders per day">
<img src="./img/hour_order.png" alt="Repartition of orders per hour">
<p class="lead mb-5"> As could be expected, people tend to order during working hours (between 8 a.m. and 20 p.m.), and during the weekend.
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">How many times do people order at Instacart (i.e. : are they faithful?)?</h4>
<img src="./img/order_numbers.png" alt="Number of orders per user" width="1000">
<p class="lead mb-5"> Most of the consumers only order 4 times. None of them order less : we concluded that we were provided
with a dataset containing only part of the user data, excluding too few faithful ones.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">How many products can one find in the Instacart dataset?</h4>
<img src="./img/products_per_dpt.png" alt="Number of products per department">
<p class="lead mb-5"> A large range of products can be found in the Instacart dataset.
The classification into departments (further divided in aisles) makes possible the identification of similar products.</p>
<p class="lead mb-5"> These first considerations being inferred, let's now move to the more interesting problem of users clustering.</p>
</div>
</div>
</div>
<div class="container text-center">
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#parag2">Users clustering</a>
</div>
</section>
<!-- Paragraph2 -->
<section class="content-section bg-light" id="parag2">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Users clustering and healthiness score</h2>
<p class="lead mb-5">We wanted to classify consumers depending on the categories of products they buy, to define a cluster of healthy consumers.
It is a tremendous challenge, as there is no available metrics to quantify the healthiness of a product in the Instacart dataset.
The risk is that a direct clustering on all available features won't differentiate between healthy and unhealthy consuming users, but rather between other
criteria (consumers eating more salty or sweet for instance).</p>
<p class="lead mb-5">Thus, we first hand-selected 14 aisles that were directly linked to healthiness. Then, we applied a Principal Component Analysis,
to reduce the numbers of features to cluster on. The aisles accounting the most for the 4 first principal dimensions were all differential in terms of healthiness.
More precisely, the first three dimensions were respectively led by "fresh fruits", "fresh vegetables" and "packaged vegetables fruits", considered to be healthy.
The 4th dimension was led by "soft drinks", more related to unhealthiness. We thus chose to cluster on these 4 dimensions.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">Clusters visualization</h4>
<img src="./img/clustering.png" alt="Clustering visualization" width="700">
<p class="lead mb-5"> The previous representation shows clusters projected in the first two dimensions of the PCA.
The three clusters are well defined, not overlapping, but the separation seems somehow artificially made.
This is most likely due to a difficulty to really differentiate the users, because of the continuum in their habits.</p>
<p class="lead mb-5"> Still, we can see tendancies emerging from the clustering, as reported below.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">Healthiness-related products categorize clusters </h4>
<img src="./img/discriminatories_aisles.PNG" alt="Chi-2 score for aisles significance">
<p class="lead mb-5"> This table shows the product aisles having the highest Chi-2 significance after clustering.
As expected after PCA, consumption of fresh fruits and vegetables are really meaningful.
Then the comsumption of soft drinks and frozen meals charcaterize unhealthy users.</p>
<p class="lead mb-5"> Looking more precisely at the individual features of our three clusters enabled us to rank them in terms of healthiness.
The most healthy one (referred to as cluster 2, purple above) is characterized by a high consumption of vegetables and herbs. It may indeed indicate a range of
people taking the time to cook instead of eating already prepared meals.</p>
<p class="lead mb-5"> The second most healthy one (referred to as cluster 1, green above) is characterized by a high consumption of fruits.</p>
<p class="lead mb-5"> The less healthy one (referred to as cluster 0, yellow above) is characterized by a high consumption of soft drinks,
bakings, snacks and candies.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">No ordering habit is specific to any cluster</h4>
<p class="lead mb-5"> We wanted to know whether different trends existed between clusters in terms of ordering habits : purchase day of the week and hours,
time between two orders, etc. Our goal was to identify good practices that could be advised to the cluster of people eating a less healthy way. For instance,
less time between two orders (indicating less meal planning) could have been associated with unhealthy consumption.
However, there is no meaningful difference between clusters in terms of ordering habits and no good practice was inferrable.
</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">Conclusion on the clusters and Healthiness score of a product</h4>
<p class="lead mb-5"> Even if we didn't manage to identify good ordering practices, we have now 3 well-defined clusters, ranked in terms of
healthiness. It is thus possible to rely on these clusters to characterize the healthiness of products individually and recommend better products
to consumers. We can define the healthiness of a product as a function of the distribution (in terms of clusters) of people who consumed this product.
</p>
<p class="lead mb-5">For instance, if a "healthy" user consumes a certain product, it gives 4 points to this product.
An "average" consumer would give 1 point, while an "unhealthy" consumer would remove 1 point to the product.
Then, it is easy to adapt this (arbitrary) scores making a weighted average taking the number of consumptions into account.
The bigger the score, the healthier the product.
In particular, if a product has a score of -1, it means that it's only consumed by people of the unhealthy cluster, and a product rated 4 is only conumed by people of the healthy cluster.
<p class="lead mb-5">
Defined so, the healthiness score may however be irrelevant when a product is only consumed by very few people, because it induces a great variance in score evaluation.
To deal with this limitation, we chose to also take into account a product's popularity by deriving an Popularity score for each product.
</p>
</div>
</div>
</div>
<div class="container text-center">
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#parag3">Popularity score</a>
</div>
</section>
<!-- Paragraph3 -->
<section class="content-section bg-primary text-white" id="parag3">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Popularity score</h2>
<p class="lead mb-5">A first way to quantify such a popularity is to look at the number of distinct consumers ordering a product.</p>
<img src="./img/popular_products.png" alt="Number of distinct consumers per product" width=1000>
<p class="lead mb-5">Some products have been consumed by very few people, sometimes 0,
so we have no (reliable) information about their scores. A quick exploration of these products show that they are not products we would
advise, because they are too specific. Our decision for these products is to give them a score of 0.</p>
<p class="lead mb-5"> To gauge the popularity of a product, we chose to infer from our dataset the average number of times it is re-bought.
Nevertheless, this raw score would not be meaningful enough and we have to take the consumer's consumption habits into account, as shown by the following example:</p>
<p class="lead mb-5">Imagine a consumer, Ada, buying everytime the same set of products, each of them 10 times, except one product,
let's say "chicory and ham", which wasn't so good, and she bought it only 4 times.
Now imagine a second consumer Robert, who doesn't like consuming always the same products, and tries to have a very varied consumption.
He bought once each product, except one, say "goat cheese", which he loved and bought 3 times. If Ada and Robert are the only consumers of the dataset,
"chicory and ham" would have a better popularity score (that is 4) than "goat cheese" (score of 3). We don't want this to happen!</p>
<p class="lead mb-5"> Thus, we first have to study carefully users' re-ordering habits!</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">Users habits</h4>
<img src="./img/reordering_1.png" alt="Reordering of products <=5">
<img src="./img/reordering_2.png" alt="Reordering of products >=5">
<p class="lead mb-5"> Most of the users consume on average less than one time a product. We observe big peaks at 0, 1/2, 1 and 2,
which must correspond to users who consumed rarely. A further result is that consumers who reorder the most on average are among the greatest consumers, as shown
by the graph below.</p>
<img src="./img/user_habits.png" alt="User habits in terms of reordering">
<p class="lead mb-5"> Most of the users only ordered a few times, and small above a certain number of orders. The average re-ordering can't be too low,
which can be explained by the fact that people who buy so much have strong consumption habits.</p>
<p class="lead mb-5"> As a final metrics for the popularity score, we thus still considered the re-ordering rate, but chose to normalize
by the average fidelity of each user (the average time he or she consumes each product). The greater this score is, the more
the product is likely to be re-bought. The absolute value of the score is difficult to interpret, but we can compare different scores quite safely.</p>
</div>
<div class="col-lg-10 mx-auto">
<h4 class="mb-4">Popularity score insights</h4>
<p class="lead mb-5"> Once defined, let's look at the distribution of the popularity score.</p>
<img src="./img/fidelity_1.png" alt="Popularity score 1">
<img src="./img/fidelity_2.png" alt="Popularity score 2">
<img src="./img/fidelity_3.png" alt="Popularity score 3">
<p class="lead mb-5">We observe that most of the values (whose weighted average over consumptions should be one !) are in the interval [0, 2]
with a great peak at 0. Let's confront those values to the reliability of each score, that corresponds to the number of times the product was bought.</p>
<img src="./img/fidelity_vs_consumption.png" alt="Fidelity vs consumption">
<p class="lead mb-5"> We can't see a very clear correlation. Nevertheless, we can remark 2 things.
<ul class="lead">
<li> Most consumed products overall are on average better-liked than others (score greater than 1).</li>
<li> A great number of products was only bought once by less than a dozen of users. </li> </p>
</ul>
<p class="lead mb-5"> We now have, for each product, two scores : a healthiness score and a popularity score. Based on these two,
we now aim to recommend "better products" to users.</p>
</div>
</div>
</div>
<div class="container text-center">
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#parag4">Recommending better products</a>
</div>
</section>
<!-- Paragraph4 -->
<section class="content-section bg-light" id="parag4">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Recommending better products</h2>
<p class="lead mb-5">We want to recommend "healthier" products to users whom we consider
as having a rather unhealthy consumption habits. To do this, we made for each aisle a graph of
products where two products that are bought by the same users are connected by an edge of low weight:
the lower the weight, the more users buy both products (and the more they buy each product).
Then, we hope that two products that are connected by a low-weight edge are appreciated by the same users
and share some characteristis. Thus, it should make sense to recommend, instead of a product deemed "unhealthy",
a "healthy" product of the same aisle and with a very short path from the "unhealthy" product.</p>
<p class="lead mb-5">When looking at recommendations for the cluster of "unhealthy" consumers,
it is worth noticing that many times, an organic product was recommended instead of a non-organic product.
Here are a few examples of such recommendations (Healthiness scale -1 to 4): </p>
<ul class="lead">
<li>Organic whole milk (Healthiness 2.6) instead of Whole milk (Healthiness 1.7)</li>
<li> Dark Blackout organic chocolate (Healthiness 2.3) instead of Snickers candy bars (Healthiness -0.16) </li>
<li> Artesano Style Bread (Healthiness 2.1) instead of Bread, Country Buttermilk (Healthiness 1.8) </li></p>
</ul>
<p class="lead mb-5"> Do you want some more examples ? Have a look at our GitHub (link at the bottom of the page) and check recommendations for yourself! </p>
</div>
</div>
</div>
<div class="container text-center">
<a class="btn btn-dark btn-xl js-scroll-trigger" href="#conclusion">Conclusion</a>
</div>
</section>
<section class="content-section bg-primary text-white" id="conclusion">
<div class="container text-justify">
<div class="row">
<div class="col-lg-10 mx-auto">
<h2 class="mb-4">Conclusion</h2>
<p class="lead mb-5"> This project demonstrated the power of data analysis.
Without any a priori information about product's healthiness, we managed, with rather simple calculations,
to advise relevant food choices that would increase both the healthiness and the probability that the user is satisfied
with the recommended product. Using homemade healthiness score and popularity score and this proximity, we obtained rather
good results without any particularly advanced technology. On average, the gain of healthiness is more than 1.3 (on a scale from -1 to 4).</p>
<p class="lead mb-5"> The limits of this model are linked to the strong assumption we made about user habits (which should be valid with such a big dataset)
that people's inter-aisles choices are as healthy as their intra-aisle choices.
Several bad predictions were made on products that are consumed by very few people, which individually have no reason to satisfy the strong assumption.
"There is no data like more data" reflects again the easiest and most efficient way to improve our results. The more users, the more accurate the strong assumption.
It would increase the reliability of the popularit score. We are here facing the sparsity of the user data
(each user being a point and each product being a feature). Another important improvement axis would be to add a notion of substituability
between products, to avoid advising people to replace apples by avocados!</p>
</div>
</div>
</div>
</section>
<!-- Footer -->
<footer class="footer text-center">
<div class="container">
<ul class="list-inline mb-5">
<li class="list-inline-item">
<a class="social-link rounded-circle text-white" href="https://github.com/auriane81/epfl-ada-ada-2019-project-adax">
<i class="icon-social-github"></i>
</a>
</li>
</ul>
<p class="text-muted small mb-0">Copyright © Auriane Cozic, Ariane Delrocq, Eloi Littner, Pierre Liorit a.k.a EPFL Adax Team 2019</p>
</div>
</footer>
<!-- Scroll to Top Button-->
<a class="scroll-to-top rounded js-scroll-trigger" href="#page-top">
<i class="fas fa-angle-up"></i>
</a>
<!-- Bootstrap core JavaScript -->
<script src="vendor/jquery/jquery.min.js"></script>
<script src="vendor/bootstrap/js/bootstrap.bundle.min.js"></script>
<!-- Plugin JavaScript -->
<script src="vendor/jquery-easing/jquery.easing.min.js"></script>
<!-- Custom scripts for this template -->
<script src="js/stylish-portfolio.min.js"></script>
</body>
</html>