Authors: Mr.Abhishek Anil Deshmukh, Dr.Subhankar Mishra

Self-citation is the unethical craft of citing papers written by the authors themselves, for a false increase in citation-count and therefore citation score.

At the end of July 2020, Dr. Subhankar Mishra sent me a Nature Article by Richard Van Noorden & Dalmeet Singh Chawla about extreme self-citing scientists. It was a study to look for academic professionals who were increasing their citation counts by self-citing their papers or getting their co-authors to do the same.

After that, we started looking for data to find the “self-citation score” of researchers in general.

We started with DBLP[1].

“The dblp computer science bibliography provides open bibliographic information on major computer science journals and proceedings. Originally created at the University of Trier in 1993, dblp is now operated and further developed by Schloss Dagstuhl.” -DBLP website

Fortunately, the dblp website provided a way to download their database, unfortunately, the data was a 2.8GB XML file. As cool as the dblp team is, they provided a java “library” to parse the XML for data analysis. But, it was too slow for our analysis to be run in a reasonable amount of time.

To overcome this problem, I wrote a python script[2]. to parse the XML and move the data into a Postgres[3]. database, which once indexed would be faster by orders to query than the XML. Before starting the analysis, to test if all the data was parsed correctly. We ran queries for some citation data we already knew, and we found out a substantial amount of citations were missing. It had nearly all the publications available, but their citation data was missing, which was what this study was about. Later on, it was found that the XML did not have that data.

We set out to find other sources of data like Orcid[., Scopus[4]., unfortunately, they did not include the citation data we needed. Google Scholar[5]. had the data but did not provide an API (between us, I tried scraping. Spoiler alert, that did not work), Microsoft Academic’s[6]. API had time-based call restrictions, which if adhered to would cause the analysis to take an unreasonable amount of time.

SourceWhy it didn't work?
DBLPLack of Citation Data
OrcidLack of Citation data for public
ScopusLack of Citation data
Google ScholarNo publicly available API
Microsoft AcademicAPI restrictions were too hard for the project

Conclusion

Now onto the real conclusion, There were no freely available sources of citation data, which would allow a large scale analysis. This is unfortunate as self-citation is an unethical practice. Such an analysis would have led us to find correlations between the environment or other factors and increase in self-citation practices, further leading us to avoid creating an environment or atmosphere which promotes or enables self-citation.

References

  1. "dblp: computer science bibliography." https://dblp.org/. Accessed 25 Aug. 2020.
  2. "The parser script" https://github.com/smlab-niser/selfcitation. Accessed 25 Aug 2020.
  3. "ORCID." https://orcid.org/. Accessed 25 Aug. 2020.
  4. "Search for an author profile - Scopus." https://www.scopus.com. Accessed 25 Aug. 2020.
  5. "Google Scholar." https://scholar.google.com/. Accessed 25 Aug. 2020
  6. "Microsoft Academic - Microsoft Research." 22 Feb. 2016, https://www.microsoft.com/en-us/research/project/academic/. Accessed 25 Aug. 2020.