Quantcast
Channel: rapidly – 100% Private Proxies – Fast, Anonymous, Quality, Unlimited USA Private Proxy!
Viewing all articles
Browse latest Browse all 5

How to check rapidly if an element is present in a large set of data

$
0
0

I am trying to harvest scientific publications data from different online sources like Core, PMC, arXiv etc. From these sources I keep the metadata of the articles (title, authors, abstract etc.) and the fulltext (only from the sources that provide it).

However, I dont want to harvest the same article’s data from different sources. That is, I want to create a mechanism that will tell if an article that I am trying to harvest is present in the dataset of the articles that I already harvested.

The first thing I’ve tried was to see if the article (which I want to harvest) has a DOI and search in the collection of metadatas (that I already harvested) for that that DOI. If it is found there then this article was already harvested. This approach, though, is very time expensive given that I should do a serial search in a collection of ~10 millions articles metadata (in XML format) and the time would increase much more for the articles that don’t have a DOI and I will have to compare other metadatas (like title, authors and date of publication).

def core_pmc_sim(core_article):     if core_article.doi is not None:      #if the core article has a doi         for xml_file in listdir('path_of_the_metadata_files'):  #parse all PMC xml metadata files             for event, elem in ET.iterparse('path_of_the_metadata_files'+xml_file): #iterate through every tag in the xml                 if (elem.tag == 'hasDOI'):                     print(xml_file, elem.text, core_article.doi)                     if elem.text == core_article.doi:  # if PMC doi is equal to the core doi then the articles are the same                         return True                 elem.clear()     return False 

What is the most rapid and memory-efficient way to achieve this?

(Whould a bloom filter be a good approach for this problem?)

The post How to check rapidly if an element is present in a large set of data appeared first on 100% Private Proxies - Fast, Anonymous, Quality, Unlimited USA Private Proxy!.


Viewing all articles
Browse latest Browse all 5

Latest Images

Trending Articles





Latest Images