Add `--optimized` mode #55

igorbrigadir · 2022-11-07T16:24:22Z

The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.

The new --optimized mode will generate CSVs that drop a bunch of columns to save space:

edit_controls.edits_remaining
edit_controls.editable_until
entities.cashtags
entities.hashtags
entities.mentions
withheld.scope
withheld.copyright
author.id
author.entities.description.cashtags
author.entities.description.hashtags
author.entities.description.mentions
author.url
author.withheld.scope
author.withheld.copyright
geo.coordinates.coordinates
geo.coordinates.type
geo.country
geo.full_name
geo.geo.type
matching_rules
__twarc.retrieved_at
__twarc.url
__twarc.version

(exact list to be revised later)

These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.

Should probably fix #36 and #47 first before this.

The text was updated successfully, but these errors were encountered:

edsu · 2022-11-07T16:38:38Z

I'm interested in hearing where the need for this optimization arose. Was it a problem generating the CSV, or reading the generated CSV in another application? It sounds like the latter?

igorbrigadir · 2022-11-07T16:50:15Z

Just trying to deduplicate columns and remove mostly empty ones, so more can fit into memory, and other tools like great expectations or pandas profiling have an easier time.

edsu · 2022-11-07T18:15:57Z

Would being able to write to parquet help in situations like that?

igorbrigadir · 2022-11-07T18:26:38Z

Yep! Definitely i think #29 goes hand in hand with this - I think all of these things are basically the same task for me to do lol

igorbrigadir mentioned this issue Jan 7, 2023

Document working with larger datasets #41

Closed

igorbrigadir added the enhancement New feature or request label Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--optimized` mode #55

Add `--optimized` mode #55

igorbrigadir commented Nov 7, 2022 •

edited

Loading

edsu commented Nov 7, 2022

igorbrigadir commented Nov 7, 2022

edsu commented Nov 7, 2022

igorbrigadir commented Nov 7, 2022

Add --optimized mode #55

Add --optimized mode #55

Comments

igorbrigadir commented Nov 7, 2022 • edited Loading

edsu commented Nov 7, 2022

igorbrigadir commented Nov 7, 2022

edsu commented Nov 7, 2022

igorbrigadir commented Nov 7, 2022

Add `--optimized` mode #55

Add `--optimized` mode #55

igorbrigadir commented Nov 7, 2022 •

edited

Loading