forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 10
/
datetimes.Rmd
609 lines (426 loc) · 22.4 KB
/
datetimes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
# Dates and times
<!-- https://creativecommons.org/2020/04/21/academic-publications-under-no-derivatives-licenses-is-misguided/ -->
## Introduction
This chapter will show you how to work with dates and times in pandas. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:
* Does every year have 365 days?
* Does every day have 24 hours?
* Does every minute have 60 seconds?
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year? (It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
### Prerequisites
This chapter will focus on the [pandas time series functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) and the [Timestamp](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html) methods in pandas, which makes it easier to work with dates and times in Python. We will also need nycflights13 for practice data.
```{python setup, message = FALSE, cache = FALSE}
import pandas as pd
import altair as alt
import numpy as np
import datetime
alt.data_transformers.enable('json')
flights = pd.read_csv("https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/flights/flights.csv")
```
## Creating date/times
pandas has two primary types of date/time data that refers to an instant in time:
* A __datetime__ is a date plus a time: it uniquely identifies an
instant in time (typically to the nearest second). pandas reports these types
as `datetime64`. The NumPy team established this native time series data type.
Being encoded as 64-bit integers allows arrays of dates to be represented very
compactly [ref](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html).
* A __Timestamp__ is a subclass of a datetime.datetime object. This pandas class
provides the speed and ease of use in handling complex time handling.
In this chapter we are only going to focus on dates and date-times. We will focus on the pandas tools for dates and times and use the `datetime` python module for some elements of handling time in Python.
To get the current date or date-time you can use `date.today()` or `datetime.now()`:
```{python}
# using datetime
datetime.datetime.today()
datetime.datetime.now()
datetime.datetime.utcnow()
# using pandas
pd.to_datetime('today')
pd.to_datetime('now')
```
Otherwise, there are three ways you're likely to create a date/time:
* From a string.
* From individual date-time components.
* From an existing date/time object.
They work as follows.
### From strings
Date/time data often comes as strings. You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes). Another approach is to use the helpers provided by pandas. They can work out the format of unambiguous dates. For example:
```{python}
pd.to_datetime("2017-01-31")
pd.to_datetime("January 31st, 2017")
pd.to_datetime("31-Jan-2017")
# using timestamp
pd.Timestamp("2017-01-31")
pd.Timestamp("January 31st, 2017")
pd.Timestamp("31-Jan-2017")
```
You can specify component order for ambiguous dates with `pd.to_datetime()`. For example:
```{python}
pd.to_datetime('10-11-09')
pd.to_datetime('10-11-09', yearfirst=True)
pd.to_datetime('10-11-09', dayfirst=True)
pd.to_datetime('10-11-09', format = '%d-%y-%m')
```
If you use numbers in `pd.to_datetime()` it is interpreted as the number of nanoseconds from 1970-01-01. However, the unit can be changed.
```{python}
pd.to_datetime(100000000)
pd.to_datetime(100000000, unit = 'ns')
pd.to_datetime(1, unit = 's')
```
To input time into a datetime add the `hh:mm:ss` format into the string of the parsing function:
```{python}
pd.to_datetime("2017-01-31 20:11:59")
pd.to_datetime("01/31/2017 08:01")
pd.Timestamp("2017-01-31 20:11:59")
pd.Timestamp("01/31/2017 08:01")
```
### From individual components
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
```{python}
flights.filter(['year', 'month', 'day', 'hour', 'minute']).head()
```
To create a date/time from this sort of input, use `Timestamp()`:
```{python}
(flights.
filter(['year', 'month', 'day', 'hour', 'minute']).
assign(departure = pd.to_datetime(flights[['year', 'month', 'day', 'hour', 'minute']]))
)
```
Let's do the same thing for each of the four time columns in `flights`. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
```{python, cache = FALSE}
# parsing function
def make_datetime_100(data, time):
date_data = (data[['year', 'month', 'day', time]].
assign(
hour = lambda x: x[time] // 100,
minute = lambda x: x[time] % 100).
filter(['year', 'month', 'day', 'hour', 'minute'])
)
out = pd.to_datetime(date_data)
return out
# data table
flights_dt = (flights.
query('(dep_time.notna()) & (arr_time.notna())').
assign(
dep_time = lambda x: make_datetime_100(x, 'dep_time'),
arr_time = lambda x: make_datetime_100(x, 'arr_time'),
sched_dep_time = lambda x: make_datetime_100(x, 'sched_dep_time'),
sched_arr_time = lambda x: make_datetime_100(x, 'sched_arr_time')
).
filter( regex = 'origin|dest|delay$|time$')
)
flights_dt
```
With this data, I can visualise the distribution of departure times across the year:
```{python}
dat_plot = (flights_dt.
filter(['dep_time']).
assign(dpd = flights_dt.dep_time.dt.floor("D")).
groupby(['dpd'], as_index = False).
count())
chart = (alt.Chart(dat_plot).
encode(x = 'dpd', y = 'dep_time').
mark_line())
chart.save("screenshots/altair_datetimes_1.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_1.png")
```
Or within a single day:
```{python}
dat_plot = (flights_dt.filter(['dep_time']).query("dep_time < 20130102").
assign(dpd = flights_dt.dep_time.dt.floor("10min")).
groupby(['dpd'], as_index = False).
count())
chart = (alt.Chart(dat_plot).
encode(x = 'dpd', y = 'dep_time').
mark_line())
chart.save("screenshots/altair_datetimes_2.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_2.png")
```
### From other types
Sometimes you'll get date/times as numeric offsets from the "Unix Epoch", 1970-01-01. If the offset is in seconds, use `as_datetime()`; if it's in days, use `as_date()`.
```{python}
pd.to_datetime(60 * 60 * 10, unit = 's')
pd.to_datetime(365 * 10 + 2, unit = 'd')
```
### Exercises
1. What happens if you parse a string that contains invalid dates?
```{python, eval = FALSE}
pd.to_datetime(['2010-10-10', 'bananas'])
pd.to_datetime(['2010-10-10', 'bananas'], errors = 'coerce')
```
## Date-time components
Now that you know how to get date-time data into Python's datetime data structures, let's explore what you can do with them. This section will focus on the [Timestamp properites](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components) that let you get and set individual components. The next section will look at how arithmetic works with datetimes.
### Getting components
You can pull out individual parts of the date with the accessor properties `.year`, `.month`, `.day()` (day of the month), `.dayofyear` (day of the year), `.dayofweek` (day of the week), `.hour`, `.minute`, and `.second`. When using these properties on a pandas series you must include `.dt.` as in `<PANDAS SERIES>.dt.month`.
```{python}
datetime = pd.to_datetime("2016-07-08 12:34:56")
datetime.year
datetime.month
datetime.day
datetime.dayofyear
datetime.dayofweek
```
To return the name of the month or day of the week use the two name methods `.day_name()` and `month_name()`.
```{python}
datetime.month_name()
datetime.day_name()
```
We can use `.day_name()` in pandas to create a new variable for your chart or use `day()` in Altair to build the date variable only for the chart using the [TimeUnit transforms](https://altair-viz.github.io/user_guide/transform/timeunit.html) to see that more flights depart during the week than on the weekend:
```{python}
flights_dt.dep_time.dt.day_name()
chart = (alt.Chart(flights_dt).
encode(alt.X('day(dep_time):O'), alt.Y('count()')).
mark_bar().
properties(width = 400))
chart.save("screenshots/altair_datetimes_3.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_3.png")
```
There's an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
```{python}
plot_dat = (flights_dt.assign(
minute = lambda x: x.dep_time.dt.minute).
groupby('minute').
agg(
avg_delay = ('arr_delay', np.mean),
n = ('arr_delay', 'size')
).reset_index()
)
chart = (alt.Chart(plot_dat).
encode(alt.X('minute'), alt.Y('avg_delay')).
mark_line())
chart.save("screenshots/altair_datetimes_4.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_4.png")
```
Interestingly, if we look at the _scheduled_ departure time we don't see such a strong pattern:
```{python}
sched_dep = (flights_dt.assign(
minute = lambda x: x.sched_dep_time.dt.minute).
groupby('minute').
agg(
avg_delay = ('arr_delay', np.mean),
n = ('arr_delay', 'size')
).reset_index()
)
chart = (alt.Chart(sched_dep).
encode(alt.X('minute'), alt.Y('avg_delay')).
mark_line())
chart.save("screenshots/altair_datetimes_5.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_5.png")
```
So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{python}
chart = (alt.Chart(sched_dep).
encode(alt.X('minute'), alt.Y('n')).
mark_line())
chart.save("screenshots/altair_datetimes_6.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_6.png")
```
### Rounding
An alternative approach to plotting individual components is to round the date to a nearby unit of time, with `.dt.floor()`, `.dt.round()`, and `.dt.ceiling()`. However, these methods don't work well for coarser roundings. For example, finding the floor of week does not work using `dt.floor()` and the recommended approach is to use `.dt.to_period()` in tandem with `.dt.to_timestep()` as we do in the following example. Each method takes a Series of datetimes to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases
```{python}
plot_dat = flights_dt.assign(
week = lambda x: x.dep_time.dt.to_period("W").dt.to_timestamp()
).groupby('week').agg(
n = ('dep_time', 'size')
).reset_index()
chart = (alt.Chart(plot_dat).
encode(alt.X('week'), alt.Y('n')).
mark_line())
chart.save("screenshots/altair_datetimes_7.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_7.png")
```
Computing the difference between a rounded and unrounded date can be particularly useful.
### Setting components
You can also use each accessor function to set the components of a date/time:
```{python}
datetime = pd.Timestamp("2016-07-08 12:34:56")
datetime.replace(year = 2020)
datetime.replace(month = 1)
datetime.replace(hour = datetime.hour + 1)
```
Alternatively, Rather than modifying each separately, you can create a new date-time with `replace()` using all of the arguments. This allows you to set multiple values at once.
```{python}
datetime.replace(year = 2020, month = 2, day = 2, hour = 2)
```
If values are too big, you will see an error:
```{python, error=TRUE}
pd.Timestamp("2015-02-01").replace(day = 30)
```
You can use `replace()` to show the distribution of flights across the course of the day for every day of the year:
__Note:__ The `Timestamp.replace()` method does not have an equivalent method that can be used on pandas Series so you will need to use an `apply()` method on the datetime series. We have avoided the use of `apply()` in this book up to this point. `apply()` is a powerful technique within pandas that is worth your time to master. You can read [the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) to find additional help.
```{python}
plot_dat = flights_dt.assign(
dep_hour = lambda x: x.dep_time.apply(lambda y: y.replace(month = 1, day = 1))
)
chart = (alt.Chart(plot_dat).
encode(alt.X('dep_hour'), alt.Y('count()')).
mark_line())
chart.save("screenshots/altair_datetimes_8.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_datetimes_8.png")
```
Setting larger components of a date to a constant is a powerful technique that allows you to explore patterns in the smaller components.
### Exercises
1. How does the distribution of flight times within a day change over the
course of the year?
1. Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent?
Explain your findings.
1. Compare `air_time` with the duration between the departure and arrival.
Explain your findings. (Hint: consider the location of the airport.)
1. How does the average delay time change over the course of a day?
Should you use `dep_time` or `sched_dep_time`? Why?
1. On what day of the week should you leave if you want to minimise the
chance of a delay?
1. What makes the distribution of `diamonds.carat` and
`flights.sched_dep_time` similar?
1. Confirm my hypothesis that the early departures of flights in minutes
20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight
was delayed.
## Time spans (Deltas)
Next you'll learn about how arithmetic with dates work, including subtraction, addition, and division. Along the way, you'll learn about pandas tools for representing time spans. In Python, when you subtract two dates, you get a Timedelta object:
```{python}
# How old is Hathaway?
h_age = pd.Timestamp.now() - pd.Timestamp("19770907")
h_age
```
A Timedelta object records a time span in nanoseconds using 64 bit integers which can make Timedeltas a little painful to work with, so Timedelta provides access to the number of days with `.days` and the number of seconds with `.total_seconds()`.
```{Python}
h_age.days
h_age.days // 365 # to get years
h_age.total_seconds()
```
Timedeltas come with a bunch of convenient constructors:
```{python}
pd.Timedelta(seconds = 15)
pd.Timedelta(minutes = 10)
pd.Timedelta(hours = 12)
pd.Timedelta(hours = 24)
pd.Timedelta(days = 5)
pd.Timedelta(weeks = 3)
# to create a series of Timedeltas you can use a for loop.
pd.Series([pd.Timedelta(days=i) for i in range(5)])
```
```{python}
2*pd.Timedelta(days = 365)
pd.Timedelta(days = 365) + pd.Timedelta(weeks = 12) + pd.Timedelta(hours = 15)
```
You can add and subtract Timedeltas to and from days:
```{python}
tomorrow = pd.Timestamp.today() + pd.Timedelta(days = 1)
last_year = pd.Timestamp.today() - pd.Timedelta(days = 365)
```
And of course, add them to dates. Timedeltas are strict durations based on the arguments. Those durations do work across the various novelties of daylight savings time and leaps in the calendar like February 29th.
```{python}
# A leap year
pd.Timestamp("2020-02-01") + pd.Timedelta(days = 30)
# Not a leap year
pd.Timestamp("2019-02-01") + pd.Timedelta(days = 30)
# Daylight Savings Time
one_pm = pd.Timestamp('2016-03-12 13:00:00', tz = 'America/New_York')
# Day two 1pm
one_pm_d2 = pd.Timestamp('2016-03-13 13:00:00', tz = 'America/New_York')
one_pm + pd.Timedelta(days = 1)
one_pm + pd.Timedelta(hours = 24)
# Notice that it correctly represents that only 23 hours had passed.
one_pm_d2 - one_pm
```
Let's use Timedeltas to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination _before_ they departed from New York City.
```{python}
flights_dt.query('arr_time < dep_time')
```
These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding `days(1)` to the arrival time of each overnight flight.
```{python, cache = FALSE}
flights_dt = flights_dt.assign(
overnight_TF = lambda x: x.arr_time < x.dep_time,
overnight = lambda x: x.overnight_TF.astype('int').apply(lambda y: pd.Timedelta(days = y)),
arr_time = lambda x: x.arr_time + x.overnight,
sched_arr_time = lambda x: x.sched_arr_time + x.overnight
)
```
Now all of our flights obey the laws of physics.
```{python}
flights_dt.query('arr_time < dep_time')
```
## Time zones
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, The Python package [pytz](https://pypi.org/project/pytz/) uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland". Pandas leverages the pytz package to handle time zones.
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
You can find out what Python thinks your current time zone is with `Sys.timezone()`:
```{python}
import time
time.tzname
```
To see the complete list of all time zone Olson names for a country use `pytz.country_timezones()`:
```{python}
import pytz
pytz.country_timezones('US')
# pytz.all_timezones()
```
In Python, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:
```{python}
x1 = pd.Timestamp("2015-06-01 12:00:00", tz = 'America/New_York')
x2 = pd.Timestamp("2015-06-01 18:00:00", tz = 'Europe/Copenhagen')
x3 = pd.Timestamp("2015-06-01 04:00:00", tz = 'Pacific/Auckland')
```
You can verify that they're the same time by converting to UTC and comparing the offsets
```{python}
x1.tz_convert('utc')
x2.tz_convert('utc')
x3.tz_convert('utc')
```
You can't do subtraction with Timestamps of different time zones:
```{python, error=TRUE}
x1 - x2
x1 - x3
```
But Timedeltas can be calculated after using `.tz_convert()`:
```{python}
x1.tz_convert('utc') - x2.tz_convert('utc')
```
Unless otherwise specified, `pd.Timestampe()` is not timezone aware. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times into a Series will often convert the time zone to UTC and then return the Series as an object or character string.
```{python}
x4 = pd.Series([x1, x2, x3])
x4
```
Combining elements that all have the same time zone will return a datetime object with that specified time zone:
```{python}
x5 = pd.Series([x1.tz_convert('utc'), x2.tz_convert('utc'), x3.tz_convert('utc')])
x5
```
Finally, Timestamp objects that don't have a specified time zone will state ase datetime objects:
```{python}
pd.Series([pd.Timestamp("2015-06-01 04:00:00"), pd.Timestamp("2015-06-01 04:00:00")])
```
You can change the time zone in two ways:
* Keep the instant in time the same, and change how it's displayed.
Use this when the instant is correct, but you want a more natural
display.
```{python}
x5a = x5.dt.tz_convert('Australia/Lord_Howe')
x5a
```
(This also illustrates another challenge of times zones: they're not
all integer hour offsets!)
* Change the underlying instant in time. Use `None` to strip the time zone
when you have an instant that has been labelled with the incorrect time
zone, and you need to fix it.
```{python}
x5b = x5.dt.tz_localize(None).dt.tz_localize('Australia/Lord_Howe')
x5b
```