Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIVO Harvester #384

Merged

Conversation

alexgarciac
Copy link
Contributor

Work in progress for the VIVO harvester.

@fabianvf fabianvf changed the title WIP [WIP] VIVO Harvester Oct 1, 2015
@fabianvf
Copy link
Contributor

fabianvf commented Oct 1, 2015

Just to explain really quickly why the build failed, we have enforced style guidelines. If you click the Details button on the travis PR check, you will see the following output:

./scrapi/settings/local-dist.py:49:17: E126 continuation line over-indented for hanging indent

./scrapi/harvesters/vivo.py:5:68: W291 trailing whitespace

./scrapi/harvesters/vivo.py:15:1: F401 'parse' imported but unused

./scrapi/harvesters/vivo.py:18:1: F401 'HumanName' imported but unused

./scrapi/harvesters/vivo.py:22:1: F401 'requests' imported but unused

./scrapi/harvesters/vivo.py:26:1: F401 'compose' imported but unused

./scrapi/harvesters/vivo.py:29:1: W293 blank line contains whitespace

./scrapi/harvesters/vivo.py:30:1: E302 expected 2 blank lines, found 1

./scrapi/harvesters/vivo.py:46:1: E302 expected 2 blank lines, found 1

./scrapi/harvesters/vivo.py:52:1: E302 expected 2 blank lines, found 1

./scrapi/harvesters/vivo.py:69:73: W291 trailing whitespace

./scrapi/harvesters/vivo.py:71:26: W291 trailing whitespace

./scrapi/harvesters/vivo.py:74:122: W291 trailing whitespace

./scrapi/harvesters/vivo.py:84:74: W291 trailing whitespace

./scrapi/harvesters/vivo.py:92:97: W291 trailing whitespace

./scrapi/harvesters/vivo.py:95:27: W291 trailing whitespace

./scrapi/harvesters/vivo.py:97:59: W291 trailing whitespace

./scrapi/harvesters/vivo.py:113:40: W291 trailing whitespace

./scrapi/harvesters/vivo.py:117:40: W291 trailing whitespace

./scrapi/harvesters/vivo.py:121:40: W291 trailing whitespace

./scrapi/harvesters/vivo.py:130:39: W291 trailing whitespace

./scrapi/harvesters/vivo.py:148:38: W291 trailing whitespace

./scrapi/harvesters/vivo.py:157:122: W291 trailing whitespace

./scrapi/harvesters/vivo.py:171:124: E226 missing whitespace around arithmetic operator

./scrapi/harvesters/vivo.py:176:51: F821 undefined name 'process_sponsorships'

./scrapi/settings/local-dist.py:53:15: E126 continuation line over-indented for hanging indent

Obviously you don't really have to worry about this for now, but just making sure you know.

@fabianvf
Copy link
Contributor

fabianvf commented Oct 1, 2015

Oh, looks like you fixed it already. Nevermind then!

@fabianvf
Copy link
Contributor

fabianvf commented Oct 1, 2015

Also, you can run this locally with flake8 .

@alexgarciac
Copy link
Contributor Author

Thanks, there is still some issues to correct, I'll have this in mind to check my changes before pushing.

('journalTitle', '/journalTitle'),
('abstract', ('/abstract', lambda x: x if x else '')),
('issue', '/issue'),
('publisher', '/publisher'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

publisher can actually fit inside the SHARE schema, under the publisher field. It has to be either an organization or a person.

@fabianvf
Copy link
Contributor

fabianvf commented Oct 1, 2015

I left some comments about things that may be able to move to top level fields, I hope that is helpful!

Also, our schema is also on github (https://github.com/CenterForOpenScience/SHARE-schema), so if you have any suggestions for improving it, feel free to bring them up there

?URI bibo:pmid ?PMID .
}}
OPTIONAL {{
?autorship a vivo:Authorship .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is autorship the correct name, or should it be authorship? I am not familiar with the API, so I am not sure if this is a typo or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a typo.

@fabianvf
Copy link
Contributor

fabianvf commented Oct 1, 2015

Also, I am fabianvf on IRC, you can usually find me in the #cos channel on freenode.

@alexgarciac
Copy link
Contributor Author

fabianvf VIVO has Academic Article
Article
Award or Honor
Blog Posting
Book
Case Study
Catalog
Chapter
Concept
Conference Paper
Conference Poster
Database
Edited Book
Editorial Article
Equipment
Extension Document
Human Study
Journal
Newsletter
News Release
Patent
Proceedings
Report
Research Proposal
Review
Series
Software
Thesis
Video
Webpage
Website

Alexander Garcia Castro [9:47 PM]
do u want to harvest everything ?

Alexander Garcia Castro [9:47 PM]
or do u want to limit the harvesting just to some specific types?

- Publisher in the right schema field
- Put out of ‘OtherProperties’ the fields that can be resolved as URIs,
the only fields that I haven’t been able to change is the ISSN and the
ISBN ‘cause I didn’t find any ISSN or ISBN resolver ¿do you know any?
- Put in Keywords in the tags field
Optimizing and splitting queries to avoid the use of OPTIONAL
statements in SPARQL.
@fabianvf
Copy link
Contributor

Looks like you need to add a requirement to the requirements.txt

@fabianvf
Copy link
Contributor

Also, sorry, looks like I missed your comment from earlier. I think nearly all of those are interesting. I would start just by harvesting them all, but provide a mechanism to limit it later if we decide something is irrelevant.

from SPARQLWrapper import SPARQLWrapper, JSON

from scrapi import settings
from scrapi.settings.sparql_mapping import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to have import *, could you at least namespace the commands (from scrapi.settings import sparql_mapping as mapping or something like that)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, having looked at the settings file now, why not just import that variable?

from scrapi.settings.sparql_mapping import SPARQL_MAPPING

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

self.sparql_wrapper.setQuery(query_str)
results = self.sparql_wrapper.query()
results = results.convert()
for result in results['results']['bindings']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just

return [ result['uri']['value']  for result in results['results']['bindings'] ]

if you want, but this is really minor stylistic quibble, so feel free to ignore.

@fabianvf
Copy link
Contributor

Looks pretty good to me. Most comments are related to making the functions a little less dense/noisy, but didn't see any real logical issues. Once you get the build passing I can take another look and see what the data looks like when I run it locally. If you have any questions or disagreements, feel free to voice them! I am on the #cos IRC channel on freenode, but if there is another way you would like to chat, just let me know and I can get in there. Really appreciate this PR, I am very excited to get it in.

Removing patterns from the functions and ‘pythonizing’ some statements.
Adding the properties required by travis and the yaml file for the vivo
harvester.
Adding the icon for the vivo harvester.
The authors completion need to make some HTTP requests so the right
thing to do is include it in the harvest phase.
def schema(self):
return {
'title': ('/title', lambda x: x if x else ''),
'providerUpdatedDateTime': ('/date', lambda x: datetime.strptime(x, "%Y-%m-%d").strftime("%Y-%m-%dT%H:%M:%S%Z") + '+00:00'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just noticed this. We actually have a helper function (helpers.datetime_formatter), which ensures that all datetimes we receive are processed into an identical format. could you use that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :)

@fabianvf fabianvf changed the title [WIP] VIVO Harvester VIVO Harvester Oct 19, 2015
@fabianvf
Copy link
Contributor

👍

fabianvf added a commit that referenced this pull request Oct 19, 2015
@fabianvf fabianvf merged commit 9423c0d into CenterForOpenScience:develop Oct 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants