A quick approach for highlighting keywords of interest within a PDF document and calculating their frequencies.
With the amount of available information increasing every day, having the ability to quickly gather relevant statistics about said information is important for relationship mapping and acquiring a new perspective on otherwise redundant data. Today we will look at text extraction, also known as information extraction, of PDFs and a quick approach to formulating some facts and ideas about different corpora. Today’s article dives into the field of Natural Language Processing (NLP), which is a computer’s ability to comprehend human language.
Information Extraction (IE), as defined by Jurafsky et al, is the “process for turning unstructured information embedded in texts into structured data.” [1]. A very quick way of information extraction is not only to search to find if a word is located within a body of the text but also to calculate the frequency of how many times that word is mentioned. This is supported by the assumption that the more a word is mentioned within a body of text, the more important it is and its relation to the corpus’s theme. It’s important to note that stopword removal is important for this given process. Why? Well, if you simply calculated all of the word frequencies within a corpus, the word the will be mentioned a lot. Does that make this word important in terms of relaying what information is within the text? No, and therefore you want to ensure you are looking at frequencies of words that contribute to the semantic meaning of your corpora.
IE can lead to other NLP techniques being used on a document. These techniques go beyond the code of this article but I felt they were both interesting and important the share.
The first technique is Named Entity Recognition (NER). As detailed by Jurafsky et al. “The task of named entity recognition (NER) is to find each named entity recognition mention of a named entity in the text and label its type.” [1] This is similar to the idea of searching for the…