-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
571 lines (435 loc) · 19.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<title>Command-line wizardry for CSV, JSON, HTML</title>
<link rel="stylesheet" href="revealjs/dist/reset.css">
<link rel="stylesheet" href="revealjs/dist/reveal.css">
<link rel="stylesheet" href="revealjs/dist/theme/moon.css">
<!-- Theme used for syntax highlighting of code -->
<link rel="stylesheet" href="https://highlightjs.org/static/demo/styles/atom-one-light.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<section data-markdown><textarea data-template>
## CLI Tools for CSV, JSON, & HTML
_How to be lazy & avoid writing programs to process data_
</textarea></section>
<section data-markdown><textarea data-template>
## Overview
Together, we will learn some things:
- Why more code isn't the answer
- Why you may not need code in the first place
- UNIX philosophy and why it _is_ the answer
- Some new tools and how to use them
- A taste of what is possible
</textarea> </section>
<section data-markdown><textarea data-template>
## Challenge!
- Take this giant CSV from the client!
- Get a bunch of related data from remote sources!
- Do a bunch of processing on the contents!
- Format and save the results in a CSV to send back!
</textarea></section>
<section data-markdown><textarea data-template>
### Meanwhile, I'm like...

</textarea></section>
<section data-markdown><textarea data-template>
### Oh, you need it by EOD?
</textarea></section>
<section data-markdown><textarea data-template>
### Put on a Happy Face

</textarea></section>
<section data-markdown><textarea data-template>
## Some Context...
- I work with people who need data to do their jobs
- Many requests are unique one-offs
- Without more tooling, I need to write custom programs
</textarea></section>
<section>
<img src="./img/frustrated.gif" alt="This makes developers sad">
</section>
<section data-markdown><textarea data-template>
## But what if you don't know how to program to begin with?
Everyone should learn to program if they want. But these kinds of
problems shouldn't be the reason you learn to program.
The tools already exist on your computer.
Save programming for more interesting problems.
</textarea></section>
<section data-markdown><textarea data-template>
## [Mike Gancarz: The UNIX Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy)
- Small is beautiful.
- Make each program do one thing well.
- Build a prototype as soon as possible.
- Choose portability over efficiency.
- Store data in flat text files.
- Use software leverage to your advantage.
- Use shell scripts to increase leverage and portability.
- Avoid captive user interfaces.
- Make every program a filter.
</textarea></section>
<section data-markdown><textarea data-template>
## Moral of the Story
Don't write programs to solve these problems.
Compose existing programs. Create solutions.
</textarea></section>
<section data-markdown><textarea data-template>

- If you're not already sure, find out what terminal program to use on
your computer
- Macs have Terminal.app already installed
- Check out [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
- These examples will use `bash` -- a "shell" used to send commands
and run CLI programs on your computer
</textarea></section>
<section>
<section data-markdown><textarea data-template>
## Tools in Your Toolbelt
MacOS and Linux machines come with a standard set of tools inspired by
the original UNIX systems:
- `cat`: print the contents of a file
- `grep`: find content in a file based on patterns
- `awk`: process text streams with filters, extractors, and reporters
- `curl`: send and process HTTP requests
- `uniq`: remove duplicates or group and count lines
- `sort`: sort files
</textarea></section>
<section data-markdown><textarea data-template>
## So Much Stuff!
There are many tools out there and it takes time to learn. Just focus
on the few that solve most of your problems now, and keep learning:
- ["10 command-line tools for data analysis in Linux"](https://opensource.com/article/17/2/command-line-tools-data-analysis-linux)
- ["7 Command Line Tools Every New Web Developer Should Know"](https://suncoast.io/blog/command-lines-web-developer/)
- ["Master the command line: How to use man pages"](https://www.macworld.com/article/2044790/master-the-command-line-how-to-use-man-pages.html)
</textarea></section>
</section>
<section>
<section data-markdown><textarea data-template>
## Elegant BASH for a More Civilized Age
The computer shell lets us work with files and programs. But more
importantly, it lets us "connect" programs together into long chains.
The standard output (stoud) of one program can become the standard input
(stdin) of another via "pipes".
</textarea></section>
<section data-markdown><textarea data-template>
## Examples
- [`examples/bash/parse-server-logs.sh`](examples/bash/parse-server-logs.sh)
- [`examples/bash/popular-commands.sh`](examples/bash/popular-commands.sh)
</textarea></section>
</section>
<section data-markdown><textarea data-template>
## "Buy Nothing"
<small>
There is a whole world of tools available to solve all kinds of new
problems. And many are free and available on most systems via package
managers, binaries, or source.
</small>
- [jq](https://stedolan.github.io/jq/) - Parse, transform, and emit JSON
from files or stdin
- [pup](https://github.com/ericchiang/pup) - Parse and extract from HTML in a file or stdin
- [csvkit](https://csvkit.readthedocs.io/en/1.0.1/) - A suite of tools for working
with data read from or written to CSV files
_There are other nominees--like [sgrep](http://sgrep.sourceforge.net/)
and [RecordStream](https://github.com/benbernard/RecordStream)--but
we'll stay focused on these 3 for now_
</textarea></section>
</section>
<section>
<section data-markdown><textarea data-template>
## JQ
#### Given:
- Some input via stdin or a file
- Some selectors, filters, transformations, and expressions
#### Output:
- A JSON value in the object
- A sub-object in the input or an entirely new object
- The result of an expression or calculation
</textarea></section>
<section data-markdown><textarea data-template>
### Getting Values Out
```bash
$ echo '{"foo": "bar"}' | jq '.foo'
> "bar"
$ echo '{"stats": { "hp": 20, "mp": 30 }}' | jq '.stats'
> {
"hp": 20,
"mp": 30
}
$ echo '{"stats": { "hp": 20, "mp": 30 }}' | jq '.stats.mp'
> 30
$ echo '{"name": "David", "age": "32", "isADavid": true}' \
| jq -c '{name,isADavid}'
> {"name":"David","isADavid":true}
```
</textarea></section>
<section data-markdown><textarea data-template>
### Expressions on Values
```bash
$ echo '[1,2,3,4,5]' | jq 'add'
> 15
$ echo '[
{"name": "David"},
{"name": "Micaline"},
{"name": "Porter"},
{"name": "Isaac"},
{"name": "Wesley"}
]' | jq -c 'map(.name | length)'
> [5,8,6,5,6]
```
</textarea></section>
<section data-markdown><textarea data-template>
### Another Example
[`examples/jq/extract/stationcode-weather.sh`](examples/jq/extract/stationcode-weather.sh)
</textarea></section>
</section>
<section>
<section data-markdown><textarea data-template>
## Pup
- Basically jq, but for HTML
- Uses CSS selectors as a language for extraction
- Not as much power in transformations and expressions
- It just pulls things out of web pages
</textarea></section>
<section data-markdown><textarea data-template>
### Examples
A hastily borrowed example: HN article titles
```bash
curl -s https://hckrnews.com/ \
| pup '#entries .entries .entry a:last-child json{}'
```
<small>_Hint: this looks like a job for jq_</small>
A more relevant example: Extracting structured JSON
```bash
curl -s https://developers.google.com/search/docs/guides/intro-structured-data \
| pup 'script[type="application/ld+json"] text{}' \
| jq '.itemListElement | .[].item'
```
</textarea></section>
</section>
<section>
<section data-markdown><textarea data-template>
## CSVKit
- Do common CSV tasks on the command-line
- Builtin support for reading `gzip` or `bz2`
- Built in Python, distributed via `pip`
- _No programming required!_
</textarea></section>
<section data-markdown><textarea data-template>
### csvsql and sql2csv
- Read from database; produce CSV
- Generate `INSERT` statements from CSV contents
- Supports various databases
- Be careful on implementation-specific types (dates, times, lists, etc.)
```bash
$ sql2csv --db postgresql:///database \
--query "select * from data" > new.csv
$ csvsql --db postgresql:///database \
--insert data.csv
$ csvsql --help | grep "\-\-dialect"
> --dialect {access,sybase,sqlite,informix,firebird,mysql,
oracle, maxdb,postgresql,mssql}
```
</textarea></section>
<section data-markdown><textarea data-template>
### csvjson
- Produce JSON from CSV contents
- Column names become object field properties
- Useful options
- `--stream`: Emit one row at a time
- `-p`: Escape character
- `--maxfield`: Increase default column width
</textarea></section>
<section data-markdown><textarea data-template>
### csvjson
Have a program that takes in JSON, but you have a CSV? Have JSON stored
in CSVs or DB columns?
Send them to csv json to turn column headers into JSON fields. Each row
becomes an object.
```bash
csvjson --stream \
| parallel -N1 "echo {} | curl https://myapi.com/data -XPOST -H 'Content-Type: application/json' -d @-"
```
See [examples/csvkit/csvjson/convert-to-json.sh](examples/csvkit/csvjson/convert-to-json.sh)
</textarea></section>
<section data-markdown><textarea data-template>
### in2csv
- Produce CSV from structured data
- _Warning_ - Not smart about nested, list data types
- Understands a few formats:
- xls,xlsx - _Some people still use these, right?_
- json - _Turn a single document into a CSV...meh_
- ndjson - _Hey, jq knows how to produce this!_
```bash
./stationcode-weather.sh | in2csv -f ndjson > weather-report.csv
```
</textarea></section>
<section data-markdown><textarea data-template>
### csvcut
- Filter and truncate CSV files
- Useful for dropping columns you don't need
- Not much else to say here...
```bash
csvcut -c county,item_name,quantity data.csv
```
</textarea></section data-markdown>
<section data-markdown><textarea data-template>
### csvjoin
- Join multiple files by common columns and values
- Supports various join strategies (inner, left, right)
- _Warning_ - Limitations on very large files (join happens in memory)
```bash
# item-detail.csv, item-prices.csv
csvjoin --left -c item_id,item_id item-detail.csv item-prices.csv \
| csvjson --stream \
| gzip \
> $OUTPUT
```
</textarea></section data-markdown>
<section data-markdown><textarea data-template>
### csvstack
- Combine homogenous data spread across multiple files
```bash
csvstack \
log-export-1.csv \
log-export-2.csv \
log-export-3.csv \
> combined.csv
```
</textarea></section data-markdown>
</section>
<section>
<section data-markdown><textarea data-template>
## Putting it all Together
[Average Dragonball Z Power Levels](http://dragonball.wikia.com/wiki/List_of_power_levels#Power_levels)
<small><em>
This is a list of known and official power levels in the Dragon Ball
universe. All of the levels on this list are taken from the manga,
anime, movies, movie pamphlets, Daizenshuu guides, video games and
stated mathematical calculations.
</em></small>

</textarea></section data-markdown>
<section data-markdown><textarea data-template>
### The Steps
1. `curl` - Download the raw HTML
2. `pup` - Extract the rows with the data we want
3. `jq` - Extract values from the DOM and calculate statistics
4. `in2csv` - Write the results to a final report
See [`examples/dbz/`](examples/dbz/)
</textarea></section data-markdown>
</section>
<section>
<section data-markdown><textarea data-template>
## Tools ▶ Composition ▶ Solutions
Sometimes these are problems you'll want to solve on a regular basis.
You can save these commands to files and give them names:
```bash
#!/bin/bash
# Save to "extract-address.sh"
curl -s $1 \
| pup 'script[type="application/ld+json"] text{}' \
# We get their contents as HTML-escaped text, so we
# convert the quotes back
| sed 's/"/"/g' \
# The '-c' flag tells jq to flatten the result to 1 line
| jq -c '({ name, url, telephone } + .address) | del(.["@type"])'
```
</textarea></section>
<section data-markdown><textarea data-template>
### Automate Dependencies and Task Runners
Find tools that can run commands, skip things that are already
done, and run everything in parallel
</textarea></section>
<section data-markdown><textarea data-template>
### Drake
_Data workflow tool, like a "Make for data"_
- [Factual/drake](https://github.com/Factual/drake) analyzes
dependencies and composes building blocks to orchestrate processing
- Dependency analysis to produce a result
- Parallelizes work and maximizes resource utilization
- Commands and programs to produce targets
- All the tools we discussed--plus many others--can be composed to
define bigger building blocks
</textarea></section>
<section data-markdown><textarea data-template>
### Drake Examples
Filter and emit simple list of unique users
```bash
; Generate unique list of users with usage information
managed_users.jsonl <- db.jsonl.gz
zcat $INPUT | jq -c '{user: .user}' | uniq > $OUTPUT
if [ $PIPESTATUS -eq 1 ]; then
echo $INPUTS >> $[ERR_FILE]; rm $OUTPUT && exit 1;
fi
```
</textarea></section>
<section data-markdown><textarea data-template>
### Drake Examples
Combine raw inputs and insert into a single table
```bash
%db-products <- products.csv.gz, product-details.csv.gz
csvjoin -c product_id,product_id $INPUT0 $INPUT1 \
| csvcut -C product_id \
| csvsql --table products --insert --db \
postgresql://postgres:$[PG_PASS]@$[PG_HOST]:$[PG_PORT]/products
```
</textarea></section>
<section data-markdown><textarea data-template>
### Drake Examples
```bash
sites-to-process.jsonl.gz <- %base
sql2csv --db '$[DB_CONN]' \
src/sql/sites-by-processing-state.sql 2>> $[ERR_FILE] \
| csvjson --stream \
| gzip \
> $OUTPUT
if [ $PIPESTATUS -eq 1 ]; then
echo $INPUTS >> $[ERR_FILE]; rm $OUTPUT && exit 1;
fi
```
</textarea></section>
</section>
<section data-markdown><textarea data-template>
## Limitations and Downsides
- Shell scripting can messy quickly
- Long-lived problems means you can't turn your laptop off
- If the problem is sufficiently complex, you'll eventually want the
flexibility of a custom program
</textarea></section>
<section data-markdown><textarea data-template>
## Takeaways
By now you should have:
- New ways to be lazy
- A healthy distaste for "the hard way"
- A passion for simple tools and composition
- An awareness of a few tools to help you crunch data
- A frustration with me for making you look at Dragonball Z
</textarea></section>
<section data-markdown><textarea data-template>
## Presentation Complete. Questions?
Go forth and do the datas.

</textarea></section>
</div>
</div>
<script src="revealjs/dist/reveal.js"></script>
<script src="revealjs/plugin/markdown/markdown.js"></script>
<script src="revealjs/plugin/highlight/highlight.js"></script>
<script>
let deck = new Reveal({
plugins: [ RevealMarkdown, RevealHighlight ]
})
deck.initialize({
history: true,
slideNumber: true,
previewLinks: true,
transition: 'slide', // none/fade/slide/convex/concave/zoom
backgroundTransition: 'fade', // none/fade/slide/convex/concave/zoom
});
</script>
</body>
</html>