Natural Language Processing Notes – Class 10 AI (417)
Master Class 10 AI Natural Language Processing!
Simplified notes on NLP Applications, Chatbots, Text Processing (Sentance Segmentation, Tokenization, Stemming, Lemmatization), Bag of Words and and TFIDF. Includes diagrams for easy and quick exam revision. This Natural Language Processing Notes is prepared as per latest CBSE curriculum.
Natural Language Processing
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables computers to understand, interpret, and process human language. It helps machines analyze text or speech and extract meaningful information, allowing them to communicate with humans more effectively.
Why NLP is important?
- It enables communication between humans and computer systems.
- It helps computers understand intent and context in language.
- It supports the development of tools and techniques for better interaction with machines.
Applications of Natural Language Processing (NLP)
- Autogenerated Captions: Converts speech into text in real-time, improving accessibility of videos.
Example: YouTube captions, Google Meet subtitles - Voice Assistants: Understand and process spoken language to perform tasks.
Example: Google Assistant, Alexa, Siri - Language Translation: Translates text or speech from one language to another, enabling global communication.
Example: Google Translate - Sentiment Analysis: Identifies whether a text expresses positive, negative, or neutral feelings.
Use: Understanding customer opinions and feedback - Keyword Extraction: Automatically finds important words or phrases from text.
Use: Analyzing trends and improving customer service - Text Classification: Categorizes text into predefined groups.
Example: News articles classified as Sports, Food, Politics
Stages of Natural Language Processing (NLP)
The stages of Natural Language Processing (NLP) typically involve the following:
- Lexical Analysis
- Syntactic Analysis
- Sematic Analysis
- Discourse Integration
- Pragmatic Analysis
Lexical Analysis
- It’s the process of breaking down the input text into tokens such as words, sentences, or structured paragraphs.
- Lexicon stands for a collection of various words and phrases used in a language.
Syntactic Analysis
- It’s the process of checking grammar of sentences and phrases.
- Its forms relation among words and eliminate logically incorrect sentences.
Semantic Analysis
- In this stage, input text is checked for meaning, and every word and phrase is checked for meaningfulness.
Discourage Integration
- It refers to how sentences in a conversation or text connect to create meaningful and clear communication.
- It ensures that each sentence is logically related to the sentences before and after it.
Pragmatic Analysis
- It focuses on understanding the intended meaning of a sentence based on context, tone, and real-world knowledge rather than just its literal meaning.
Chatbots
- One of the most common applications of Natural Language Processing (NLP) is a chatbot.
- A chatbot is a computer program designed to simulate human conversation through text or voice.
- It can interact with users, understand their queries, and respond accordingly.
- Over time, chatbots can learn and improve their interactions.
- Chatbots are widely used to answer questions, solve customer problems, provide support, generate leads, and boost sales on e-commerce platforms.
- They help businesses communicate with users quickly and efficiently.
- Some popular chatboats are: Eliza, Mitsuku, Cleverbot, Singtel
Types of Chatbots
- Script-bot
- Smart-bot
| Script Bot | Smart Bot |
| Easy to make | Flexible and powerful |
| Works around a pre-defined script programmed in them | Works on bigger databases and other resources directly |
| Mostly free and easy to integrate with messaging platforms | Learns with more data |
| No or little language processing skills | Requires coding to develop |
| Limited functionality | Wide functionality |
Text Processing
Text processing is the process in NLP where raw human language is cleaned, simplified, and prepared so that computers can understand and work with it effectively.
Steps of Text Processing:
- Text Normalisation:
- Bag of Words:
After cleaning, the text is converted into numerical form by representing it as words and their frequency, so that computers can analyze it easily.
Text Normalization
In text normalization, the text is cleaned and simplified by removing unnecessary elements like punctuation, symbols, and converting text into a standard form.
this process, we often work with a collection of text from multiple documents, which is called a corpus. All the text from these documents together forms the corpus on which text normalisation techniques are applied.
Steps in Text Normalization
- Sentence Segmentation
- Tokenization
- Removing Stop words, Special Characters, Numbers
- Converting Text to Common Case
- Stemming
- Lemmatization
Sentence Segmentation
In sentence segmentation, the entire corpus is divided onto sentences
Working on textual data from multiple documents put together is called Corpus.
| Natural Language Processing is interesting. It helps computers understand human language. It is widely used in chatbots. |
Tokenization
- After segmenting the sentences, each sentence is then further divided into tokens.
- Tokens is a term used for any word or number or special character occurring in a sentence.
| Natural Language Processing is interesting.It helps computers understand human language. |
Removing Stop words, Special Characters and Numbers
Stop words (such as articles, pronouns, and prepositions) are removed from the token list because they carry little meaningful information and are not important for analysis.
Special characters and numbers may also be removed if they are not useful, but they can be kept when they carry important information (like in email IDs).
Converting Text to a Common Case
After removing stop words, the text is converted into a single case (usually lowercase). This prevents the computer from treating the same words as different just because of capital letters.
Stemming
stemming is the process in which the affixes of words are removed and the words are converted to their base form.
In stemming, words are reduced to their base form by removing prefixes or suffixes. However, the resulting word may not always be meaningful (e.g., studies → studi). Stemming does not check meaning—it simply removes affixes, which makes it a fast process.
| Word | Affixes | Stem |
| going | ing | go |
| goes | es | go |
| studies | es | studi |
Lemmatization
- Stemming and Lemmatization both are alternative processes to each other as both used to remove affixes.
- But in Lemmatization, the word we get after affix removal is a meaningful one.
| Word | Affixes | Stem |
| going | ing | go |
| goes | es | go |
| studies | es | study |
Bag of Words
- Bag of Words is a Natural Language Processing model which helps in extracting features out of the text which can be helpful in machine learning algorithms.
- In the bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.
- Bag of words gives us two things:
- Vocabulary of words for corpus
- Frequency of these words
Steps to implement Bag of Words
- Text Processing
- Create a Dictionary
- Create a Document Vector
- Create Document Vectors for all the documents
Step 1: Collect Data & Pre-process
We start with documents and apply text normalisation.
Original Documents:
- Document 1: Aman and Avni are stressed
- Document 2: Aman went to a therapist
- Document 3: Avni went to download a health chatbot
After Text Normalisation (Tokens):
- Document 1: [aman, and, avni, are, stressed]
- Document 2: [aman, went, to, a, therapist]
- Document 3: [avni, went, to, download, a, health, chatbot]
Step 2: Create Dictionary (Vocabulary)
List all unique words from all documents.
Dictionary:
- aman, and, avni, are, stressed
- went, to, a, therapist
- download, health, chatbot
(Each word is written only once, even if repeated in documents.)
Step 3: Create Document Vector (Example: Document 1)
For each word in the dictionary:
- Put 1 if the word is present
- Put 0 if the word is absent
| Word | aman | and | avni | are | stressed | went | to | a | therapist | download | health | chatbot |
| Document 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Step 4: Create Vectors for All Documents
| Document | aman | and | avni | are | stressed | went | to | a | therapist | download | health | chatbot |
| Document 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Document 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Document 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Final Output
- The table above is called the Document Vector Table
- Rows = Documents
- Columns = Words (Vocabulary)
- Values = Frequency (0 or 1 here)
TFIDF
- It refers to Term Frequency and Inverse Document Frequency
- It helps us identify the value of each word.
- The purpose of TFIDF is to find those rare and valuable words which occurs the least but add the most value to the corpus.
Term Frequency
- Term frequency is the frequency of a word in one document.
- Term frequency can easily be found in the document vector table
| Aman | And | Avni | Are | Stressed | Went | To | A | Therapist | Download | Health | Chatbot |
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 12 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 1 | 1 | 12 | 0 | 1 | 1 | 1 |
Inverse Document Frequency
- Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents.
- Inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.
Document Frequency
| Aman | And | Avni | Are | Stressed | Went | To | A | Therapist | Download | Health | Chatbot |
| 2 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 1 |
Inverse Document Frequency
| Aman | And | Avni | Are | Stressed | Went | To | A | Therapist | Download | Health | Chatbot |
| 3/2 | 3/1 | 3/2 | 3/1 | 3/1 | 3/2 | 3/2 | 3/2 | 3/1 | 3/1 | 3/1 | 3/1 |
Final TFIDF Formula
TFIDF(W) = TF(W)*log(IDF(W))
Exp: 1*log(3/2)
| Aman | And | Avni | Are | Stressed | Went | To | A | Therapist | Download | Health | Chatbot |
| 3/2 | 3/1 | 3/2 | 3/1 | 3/1 | 3/2 | 3/2 | 3/2 | 3/1 | 3/1 | 3/1 | 3/1 |
| Aman | And | Avni | Are | Stressed | Went | To | A | Therapist | Download | Health | Chatbot |
| 0.176 | 0.477 | 0.176 | 0.477 | 0.477 | 0.477 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0.176 | 0 | 0 | 0 | 0 | 0.176 | 0.176 | 0.176 | 0.477 | 0 | 0 | 0 |
| 0 | 0 | 0.176 | 0 | 0 | 0.176 | 0.176 | 0.176 | 0 | 0.477 | 0.477 | 0.477 |
TFIDF – Key Points
- Words that appear frequently in all documents get low TF-IDF values and are usually treated as stop words.
- A word gets a high TF-IDF value when it appears often in one document but rarely in others, showing it is important for that document.
- Higher TF-IDF values indicate more important words for understanding and processing the text.
Applications of TFIDF
- Document Classification:
Helps in identifying the type or category of a document. - Topic Modelling:
Helps in predicting the main topic of a group of documents (corpus). - Information Retrieval System:
Helps in extracting important and relevant information from large text data. - Stop Word Filtering:
Helps in removing unnecessary and less meaningful words from text.