With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. While cv.fit(.) would only create the vocabulary, cv.fit_transform(.) creates the vocabulary and returns a term-document matrix which is what we want. #ignore words that appear in 85% of documents,Ĭv=CountVectorizer(max_df=0.85,stop_words=stopwords) Stopwords=get_stop_words("resources/stopwords.txt") Stop_set = set(m.strip() for m in stopwords) With open(stop_file_path, 'r', encoding="utf-8") as f: We can use the CountVectorizer to create a vocabulary from all the text in our df_idf followed by the counts of words in the vocabulary (see: usage examples for CountVectorizer).įrom sklearn.feature_extraction.text import CountVectorizer We now need to create the vocabulary and start the counting process. Creating Vocabulary and Word Counts for IDF You can do a lot more stuff in pre_process(.), such as eliminate all code sections, normalize the words to its root, etc, but for simplicity we perform only some mild pre-processing. All of the cleaning happens in pre_process(.). Hmmm, this doesn’t look very readable, does it? Well, that’s because we are cleaning the text after we concatenated the two fields (line 18). The text above is essentially a combination of the title and body of a stack overflow post. We will also print the second text entry in our new field just to see what the text looks like.ĭf_idf = df_idf + df_idfĭf_idf = df_idf.apply(lambda x:pre_process(x)) We will now create a field that combines both body and title so we have it in one field. What we are mostly interested in for this tutorial, is the body and title which will become our source of text for keyword extraction. Notice that this stack overflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don’t quite need for this tutorial. Print("Number of questions,columns=",df_idf.shape) Schema: import pandas as pdĭf_idf=pd.read_json("data/stackoverflow-data-idf.json",lines=True) Here, lines=True simply means we are treating each line in the text file as a separate json string. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. The first thing we’ll do is to take a peek at our dataset. This dataset is based on the publicly available stack overflow dump from Google’s Big Query. We will be using two files, one file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and another file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. In this keyword extraction tutorial, we’ll be using a stack overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. For a more academic explanation I would recommend my Ph.D advisor’s explanation. There are a couple of videos online that give an intuitive explanation of what it is. If you are not, please familiarize yourself with the concept before reading on. I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. Let’s get started with python keyword extraction In this article, you will learn how to perform keyword extraction using python, specifically using TF-IDF from the scikit-learn package to extract keywords from documents. TF-IDF can be used for a wide range of tasks including text classification, clustering / topic-modeling, search, keyword extraction and a whole lot more. These keywords are also referred to as topics in some applications. For example, keywords from this article would be tf-idf, scikit-learn, keyword extraction, extract and so on. Keywords are descriptive words or phrases that characterize your documents. In this era of use Deep Learning for everything, one may be wondering why you would even use TF-IDF for any task at all ?!! The truth is TF-IDF is easy to understand, easy to compute and is one of the most versatile statistic that shows the relative importance of a word or phrase in a document or a set of documents in comparison to the rest of your corpus.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |