Skip to content

Latest commit

 

History

History
132 lines (97 loc) · 4.1 KB

README.md

File metadata and controls

132 lines (97 loc) · 4.1 KB

data-diff
data-diff

Data Diff: Compare files with Python

License License

Summary

Using Pandas package to compare two tables with simular schema but different data. VERY DRAFT. Works only with csv files at present.

Example 1:

A simple example comparing two tables with unique ids when using the ID and CODE fields as composite keys:

table A:

ID CODE NAME GROUP VAL
1 A FOO HIGH 5
2 A BAR MED 13
3 B SALSA MED 2

table B:

ID CODE NAME GROUP VAL
1 A FOO HIGH 5
2 A BAR HIGH 15

to compare we run the utility like this:

csv_diff 'c:/data/tableA.csv' 'c:/data/tableB.csv' 
   --output 'c:/data/diff_table.csv' 'ID' 'CODE' --namea 'tableA'
   --nameb 'tableB'

Results in the Data Diff Comparison:

ID CODE NAME GROUP VAL Compare
1 A FOO HIGH 5 tableA == tableB
2 A BAR MED --> HIGH 13 --> 15 tableA <> tableB
3 B SALSA MED 2 --> 8 tableA --> ???

The first record was an exact match, the second had some changes, and the third could only be found in table A.

Example 2:

A simple example of comparing "dirty" data, where the index fields are not quite unique:

table A:

ID CODE NAME GROUP VAL
1 A FOO HIGH 5
1 A FOO MED 13
3 B SAL MED 2
3 B SAL LOW 10

table B:

ID CODE NAME GROUP VAL
1 A FOO HIGH 5
3 B SAL MED 8

to compare we run the utility like this (note --dirty):

csv_diff 'c:/data/tableAdirty.csv' 'c:/data/tableBdirty.csv' 
   --output 'c:/data/diff_table_dirty.csv' 'ID' 'CODE' --namea 'tableA'
   --nameb 'tableB' --dirty

Results in the Data Diff Comparison:

ID CODE NAME GROUP VAL Compare
1 A FOO HIGH 5 tableA == tableB
1 A FOO MED 13 tableA --> ???
3 B SAL MED 2 tableA --> [CNC]
3 B SAL LOW 10 tableA --> [CNC]
3 B SAL MED 8 tableB --> [CNC]

The first record matches, the second was only found in table A, the rest could not be compared.

*CNC means Could not compare. If there are more than one option for an index and no exact match, the tool will not try to use any logic to pick one to compare.

If there are field names that only exist in one input table these fields are not part of the comparison and are appended at the end after the Compare field.

To see details of the process, see process diagram here.

Use It:

pip install git+https://github.com/ryansutc/data_diff@master#egg=data_diff

~ Note to self..

To build as exe:

pyinstaller csv_diff.py -F --onedir

To setup dev env:

  • Make sure you have latest pipenv. pip install --upgrade pipenv
  • cd to root data_diff folder and build virtual env with packages specified in pipfile. pipenv install --dev
  • make sure IDE is using your virtual env for its interpretor. In PyCharm, open settings (ctl+alt+s), under "Project" select "Project Interpretor". Select virtual env & save as default for workspace.
  • activate virtual environment in terminal via pipenv shell
  • Always install additional packages w. pipenv install [package] or pipenv install [package] --dev.