I am trying to harvest scientific publications data from different online sources like Core, PMC, arXiv etc. From these sources I keep the metadata of the articles (title, authors, abstract etc.) and the fulltext (only from the sources that provide it).
However, I dont want to harvest the same article’s data from different sources. That is, I want to create a mechanism that will tell if an article that I am trying to harvest is present in the dataset of the articles that I already harvested.
The first thing I’ve tried was to see if the article (which I want to harvest) has a DOI
and search in the collection of metadatas (that I already harvested) for that that DOI
. If it is found there then this article was already harvested. This approach, though, is very time expensive given that I should do a serial search in a collection of ~10 millions articles metadata (in XML format) and the time would increase much more for the articles that don’t have a DOI
and I will have to compare other metadatas (like title, authors and date of publication).
def core_pmc_sim(core_article): if core_article.doi is not None: #if the core article has a doi for xml_file in listdir('path_of_the_metadata_files'): #parse all PMC xml metadata files for event, elem in ET.iterparse('path_of_the_metadata_files'+xml_file): #iterate through every tag in the xml if (elem.tag == 'hasDOI'): print(xml_file, elem.text, core_article.doi) if elem.text == core_article.doi: # if PMC doi is equal to the core doi then the articles are the same return True elem.clear() return False
What is the most rapid and memory-efficient way to achieve this?
(Whould a bloom filter be a good approach for this problem?)
The post How to check rapidly if an element is present in a large set of data appeared first on 100% Private Proxies - Fast, Anonymous, Quality, Unlimited USA Private Proxy!.