How to check rapidly if an element is present in a large set of data

I am trying to harvest scientific publications data from different online sources like Core, PMC, arXiv etc. From these sources I keep the metadata of the articles (title, authors, abstract etc.) and the fulltext (only from the sources that provide it).

However, I dont want to harvest the same article’s data from different sources. That is, I want to create a mechanism that will tell if an article that I am trying to harvest is present in the dataset of the articles that I already harvested.

The first thing I’ve tried was to see if the article (which I want to harvest) has a DOI and search in the collection of metadatas (that I already harvested) for that that DOI. If it is found there then this article was already harvested. This approach, though, is very time expensive given that I should do a serial search in a collection of ~10 millions articles metadata (in XML format) and the time would increase much more for the articles that don’t have a DOI and I will have to compare other metadatas (like title, authors and date of publication).

def core_pmc_sim(core_article):     if core_article.doi is not None:      #if the core article has a doi         for xml_file in listdir('path_of_the_metadata_files'):  #parse all PMC xml metadata files             for event, elem in ET.iterparse('path_of_the_metadata_files'+xml_file): #iterate through every tag in the xml                 if (elem.tag == 'hasDOI'):                     print(xml_file, elem.text, core_article.doi)                     if elem.text == core_article.doi:  # if PMC doi is equal to the core doi then the articles are the same                         return True                 elem.clear()     return False

What is the most rapid and memory-efficient way to achieve this?

(Whould a bloom filter be a good approach for this problem?)

The post How to check rapidly if an element is present in a large set of data appeared first on 100% Private Proxies - Fast, Anonymous, Quality, Unlimited USA Private Proxy!.

How to check rapidly if an element is present in a large set of data

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112