While we ingest the data from the API, we will apply some criteria. First, we will only ingest documents where the year is between 2016 and 2022. We want fairly recent language as terms and taxonomy of certain subjects can change over long periods of time.We will also add key terms and conduct multiple searches. While normally we would likely ingest random subject areas, we will use key terms to narrow our search. This way, we will have an idea of how may high-level topics we have, and can compare that to the output of the model. Below, we create a function where we can add key terms and conduct searches through the API.import pandas as pdimport requestsdef import_data(pages, start_year, end_year, search_terms):”””This function is used to use the OpenAlex API, conduct a search on works, a return a dataframe with associated works.Inputs: – pages: int, number of pages to loop through- search_terms: str, keywords to search for (must be formatted according to OpenAlex standards)- start_year and end_year: int, years to set as a range for filtering works”””#create an empty dataframesearch_results = pd.DataFrame()for page in range(1, pages):#use paramters to conduct request and format to a dataframeresponse = requests.get(f’https://api.openalex.org/works?page={page}&per-page=200&filter=publication_year:{start_year}-{end_year},type:article&search={search_terms}’)data = pd.DataFrame(response.json()[‘results’])#append to empty dataframesearch_results = pd.concat([search_results, data])#subset to relevant featuressearch_results = search_results[[“id”, “title”, “display_name”, “publication_year”, “publication_date”,”type”, “countries_distinct_count”,”institutions_distinct_count”,”has_fulltext”, “cited_by_count”, “keywords”, “referenced_works_count”, “abstract_inverted_index”]]return(search_results)We conduct 5 different searches, each being a different technology area. These technology areas are inspired by the DoD “Critical Technology Areas”. See more here:Here is an example of a search using the required OpenAlex syntax:#search for Trusted AI and Autonomyai_search = import_data(35, 2016, 2024, “‘artificial intelligence’ OR ‘deep learn’ OR ‘neural net’ OR ‘autonomous’ OR drone”)After compiling our searches and dropping duplicate documents, we must clean the data to prepare it for our topic model. There are 2 main issues with our current output.The abstracts are returned as an inverted index (due to legal reasons). However, we can use these to return the original text.Once we obtain the original text, it will be raw and unprocessed, creating noise and hurting our model. We will conduct traditional NLP preprocessing to get it ready for the model.Below is a function to return original text from an inverted index.def undo_inverted_index(inverted_index):”””The purpose of the function is to ‘undo’ and inverted index. It inputs an inverted index andreturns the original string.”””#create empty lists to store uninverted indexword_index = []words_unindexed = []#loop through index and return key-value pairsfor k,v in inverted_index.items(): for index in v: word_index.append([k,index])#sort by the indexword_index = sorted(word_index, key = lambda x : x[1])#join only the values and flattenfor pair in word_index:words_unindexed.append(pair[0])words_unindexed = ‘ ‘.join(words_unindexed)return(words_unindexed)Now that we have the raw text, we can conduct our traditional preprocessing steps, such as standardization, removing stop words, lemmatization, etc. Below are functions that can be mapped to a list or series of documents.def preprocess(text):”””This function takes in a string, coverts it to lowercase, cleansit (remove special character and numbers), and tokenizes it.”””#convert to lowercasetext = text.lower()#remove special character and digitstext = re.sub(r’\d+’, ”, text)text = re.sub(r'[^\w\s]’, ”, text)#tokenizetokens = nltk.word_tokenize(text)return(tokens)def remove_stopwords(tokens):”””This function takes in a list of tokens (from the ‘preprocess’ function) and removes a list of stopwords. Custom stopwords can be added to the ‘custom_stopwords’ list.”””#set default and custom stopwordsstop_words = nltk.corpus.stopwords.words(‘english’)custom_stopwords = []stop_words.extend(custom_stopwords)#filter out stopwordsfiltered_tokens = [word for word in tokens if word not in stop_words]return(filtered_tokens)def lemmatize(tokens):”””This function conducts lemmatization on a list of tokens (from the ‘remove_stopwords’ function).This shortens each word down to its root form to improve modeling results.”””#initalize lemmatizer and lemmatizelemmatizer = nltk.WordNetLemmatizer()lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]return(lemmatized_tokens)def clean_text(text):”””This function uses the previously defined functions to take a string and\run it through the entire data preprocessing process.”””#clean, tokenize, and lemmatize a stringtokens = preprocess(text)filtered_tokens = remove_stopwords(tokens)lemmatized_tokens = lemmatize(filtered_tokens)clean_text = ‘ ‘.join(lemmatized_tokens)return(clean_text)Now that we have a preprocessed series of documents, we can create our first topic model!