Skip to content
This repository has been archived by the owner on Feb 28, 2018. It is now read-only.

Importing and serializing company data #263

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Ilozuluchris
Copy link

This is a WIP but I want to invite @cuducos to review what I have done so far #211

@cuducos cuducos changed the base branch from cuducos-rows to master October 7, 2017 13:50
Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ilozuluchris! One more many thanks for the WIP PR. First of all I changed the merge branch to master, otherwise it would be impossible to browser the changes and picking up only what you've actually changed and what was merged since I created cuducos-rows branch.

That said you're in the right path. The main point is that rows powerful import tool can handle all serialization so we can get rid of that in Jarbas source code. So there is still a few extras steps (mostly related to the force_types kwarg).

In addition, there are some Python style issues that are indeed minor stuff. Just run prospector before you declare you code ready and probably you can spot them all ; )

import lzma
import rows
import rows.fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need to import rows and then rows.fields?

@@ -23,6 +24,8 @@ def handle(self, *args, **options):

self.save_companies()



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need these extra spaces: according to PEP8 1 line is fine to separate class methods.

return row._asdict()



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need these extra spaces: according to PEP8 1 line is fine to separate class methods.



class InputDateField(rows.fields.DateField):
INPUT_FORMAT = '%d/%m/%Y'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd declare this class in the beginning of the file so when you use it in the other class the reader is aware about what this field is ; )

@@ -45,6 +48,12 @@ def save_companies(self):
self.count += 1
self.print_count(Company, count=self.count)

@staticmethod
def transform(row):
return row._asdict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a method for that… calling _asdict() in the for loop seams enough.

@@ -63,17 +72,15 @@ def save_activities(self, row):

return [main], secondaries


def serialize(self, row):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the great point of using rows is to get rid of the serialize method. The idea is to use the force_types keyword argument when calling rows.import_from_csv and let rows serialize stuff as we need it (for example, #211).

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add that comment on the previous review, sorry.

@@ -31,7 +34,7 @@ def save_companies(self):
skip = ('main_activity', 'secondary_activity')
keys = tuple(f.name for f in Company._meta.fields if f not in skip)
with lzma.open(self.path, mode='rt', encoding='utf-8') as file_handler:
for row in csv.DictReader(file_handler):
for row in map(self.transform,rows.import_from_csv(filename_or_fobj=file_handler)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might work but this is not pythonic. First of all, style:

  1. try to keep things under 80 chars
  2. list comprehension usually are more readable than map - Zen of Pytthon
  3. always add a space after come when separating values (in sequences) and arguments (in function calls) — also PEP8

I'd try something like:

for row in rows.import_from_csv(file_handler):
    row = row._asdict()  # and we can get rid of the self.transform call
    …

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @cuducos Thanks for the review,I would make the necessary changes .

Copy link
Author

@Ilozuluchris Ilozuluchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuducos rows.fields.EmailField raises a value error when it tries to serialize 'ahoy'

@cuducos
Copy link
Collaborator

cuducos commented Oct 9, 2017

rows.fields.EmailField raises a value error when it tries to serialize 'ahoy'

IMHO that is expected as 'ahoy' is not a valid email address… right?

@Ilozuluchris
Copy link
Author

Ilozuluchris commented Oct 9, 2017

Yeah,it is...the serialize method returned invalid email addresses as none.I just mentioned it since that was not how the 'serialize' method handled invalid email addresses.
Do you have any corrections to make?

@Ilozuluchris
Copy link
Author

Ilozuluchris commented Oct 14, 2017

@cuducos I am confused,I do not know how to go about this https://github.com/datasciencebr/jarbas/blob/f95addd5ea4c47f161796cf1e72a58050f66ce60/jarbas/core/management/commands/suspicions.py#L50 since the serialize method actually reshapes rows

@cuducos
Copy link
Collaborator

cuducos commented Oct 15, 2017

@cuducos I am confused,I do not know how to go about this

Would you mind expanding on your doubt? I mean… you just highlighted the method we're trying to get rid of, so I don't know exactly where you doubt lies…

@Ilozuluchris
Copy link
Author

    reserved_keys = (
        'applicant_id',
        'document_id',
        'probability',
        'year'
    )
    hypothesis = tuple(k for k in row.keys() if k not in reserved_keys)
    pairs = ((k, v) for k, v in row.items() if k in hypothesis)
    filtered = filter(lambda x: self.bool(x[1]), pairs)
    suspicions = {k: True for k, _ in filtered} or None

    return dict(
        document_id=document_id,
        probability=probability,
        suspicions=suspicions
    )

I was talking about this arrangement,am not sure how to go about this with rows

@cuducos
Copy link
Collaborator

cuducos commented Oct 15, 2017

Ow… you're right! I don't know any thing by heart, maybe here we do need a serialization method!… @turicas, any idea on that?

Our input is a CSV like:

document_id, hypothesis_1, hypotehesis_2, hypothesis_3
42,True,False,True

And the expected output is something like:

[
    {
        'document_id': 42,
        'suspicions': {
            'hypothesis_1': True,
            'hypothesis_3': True
        }
    }
]

@Ilozuluchris
Copy link
Author

Hello @cuducos, anything left wrt issue #211

@cuducos
Copy link
Collaborator

cuducos commented Dec 20, 2017

Hello @cuducos, anything left wrt issue #211

Yep, tests are failing, so there is still something to fix. You might consider adding rows to the requirements.txt; )

@Ilozuluchris
Copy link
Author

Okay I will, thanks.

@Ilozuluchris
Copy link
Author

@cuducos The travis CI job does not run, the requests page reports this

GitHub payload is missing a merge commit (mergeable_state: "unknown", merged: false)

I believe it has to do with the merge conflict wrt requirements.txt. I tried to resolve myself but that did not work as expected.

@cuducos
Copy link
Collaborator

cuducos commented Dec 24, 2017

The travis CI job does not run

After fixing conflicts it is still red:

======================================================================
ERROR: test_save_companies (jarbas.core.tests.test_companies_command.TestCreate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/python/3.5.4/lib/python3.5/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/opt/python/3.5.4/lib/python3.5/unittest/case.py", line 605, in run
    testMethod()
  File "/opt/python/3.5.4/lib/python3.5/unittest/mock.py", line 1159, in patched
    return func(*args, **keywargs)
TypeError: test_save_companies() missing 1 required positional argument: 'lzma'

@cuducos
Copy link
Collaborator

cuducos commented Feb 23, 2018

@Ilozuluchris, I'm not sure why the CodeClimate checks are missing. Gonna fix that and then we come back to this PR, ok?

@Ilozuluchris
Copy link
Author

@cuducos Okay no problem, I am sorry it took so long to do this.

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ilozuluchris, many thanks for pushing this PR further. I'm just afraid CodeClimate could help you with code style. Would you mind pushing a new commit so the CodeClimate check runs? (I just fixed it).

phone=b'', responsible_federative_entity=b'', situation=b'', zip_code=b'',
situation_date=datetime.date(2005, 9, 24), situation_reason=b'', type='Book',
special_situation_date=None, state=b'', status=b'', trade_name=b'',
special_situation=b'')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating a dictionary and unpacking it to enhance the readability here?

kwargs = {
    'additional_address_details': b'',
    'address': b'',
    …
}
create.assert_called_once_with(**kwargs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will.

@Ilozuluchris
Copy link
Author

@cuducos All checks passed.

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, just some formatting issues (don'tknow why) Code Climate skips.

'special_situation_date': CompaniesDate,
'latitude': rows.fields.FloatField,
'longitude': rows.fields.FloatField
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a contant, could you named it with caps (i.e. COMPANIES_CSV_FIELDS) and put it before the class definition (just after the imports)?

Also, in Python we use one of these formatting styles:

COMPANIES_CSV_FIELDS = {
    'email': rows.fields.EmailField,
    'opening': CompaniesDate,
    'situation_date': CompaniesDate,
    'special_situation_date': CompaniesDate,
    'latitude': rows.fields.FloatField,
    'longitude': rows.fields.FloatField
}

Or:

COMPANIES_CSV_FIELDS = {'email': rows.fields.EmailField,
                        'opening': CompaniesDate,
                        'situation_date': CompaniesDate,
                        'special_situation_date': CompaniesDate,
                        'latitude': rows.fields.FloatField,
                        'longitude': rows.fields.FloatField}

Also, can you reflect this formatting decision in arbas/core/tests/test_companies_command.py (line 46), isting one field per line?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants