Comprehensive-Text-Extraction-and-Analysis-for-Article-Metrics

Project Overview

The objective of this project is to extract textual data from given URLs and perform text analysis to compute various metrics. The analysis includes sentiment analysis, word complexity, and average word length. The results are saved in text files and a final CSV file.

Approach

Data Extraction

Read Input Data:
    Load the URLs from an Excel file located at /path/to_your/file/data/Input.xlsx.

Scrape Articles:
    For each URL, use the requests library to fetch the web page content and BeautifulSoup to parse the HTML.
    Extract the article text from specific HTML elements.

Handle Blank Links

Identify Missing Data:
    Check if the expected HTML elements are missing. If so, log these URLs for further processing.

Attempt Alternative Extraction:
    Reattempt data extraction for URLs with missing data using different HTML elements or methods.

Text Preprocessing

Tokenization:
    Tokenize the text into words using NLTK's word_tokenize.

Stopwords Removal:
    Remove common English stopwords using NLTK's stopword list.

Lemmatization:
    Lemmatize tokens to their base form using NLTK's WordNetLemmatizer.

Text Analysis

Sentiment Analysis:
    Use predefined positive and negative word lists to calculate sentiment scores.

Word Complexity:
    Calculate the percentage of complex words (words with more than two syllables).

Average Word Length:
    Compute the average length of words in the text.

Saving Results

Text Files:
    Save the extracted articles as text files in the path/to_your/file/text_files directory.

CSV File:
    Save the final analysis results in a CSV file located at path/to_your/file/final.csv.

Prerequisites

Ensure you have Python 3.x installed. This project uses several Python libraries, listed in requirements.txt.

Setup

Clone the Repository:

git clone <repository_url>
cd <repository_directory>

Create and Activate Virtual Environment (Optional but recommended):

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```

Download NLTK Data: Open a Python shell and run:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Running the Project

Ensure the Excel file Input.xlsx is placed in the data directory.
Run the Main Script:
```
python3 training.py
```
Outputs:
- Extracted text files will be saved in the text_files directory.
- The final merged CSV file will be saved in the data directory as final.csv.

Dependencies

beautifulsoup4==4.9.3
nltk==3.5
numpy==1.19.5
pandas==1.2.1
requests==2.25.1
urllib3==1.26.5

These dependencies are also listed in the requirements.txt file.

Notes

Make sure you have a stable internet connection while running the script, as it involves fetching data from the provided URLs.
If you encounter any issues, check the error messages and ensure that all dependencies are installed correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
MasterDictionary		MasterDictionary
SRC		SRC
StopWords		StopWords
data		data
nltk_data		nltk_data
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comprehensive-Text-Extraction-and-Analysis-for-Article-Metrics

Project Overview

Approach

Prerequisites

Setup

Running the Project

Dependencies

Notes

About

Releases

Packages

Languages

mParthSaharanf/Comprehensive-Text-Extraction-and-Analysis-for-Article-Metrics

Folders and files

Latest commit

History

Repository files navigation

Comprehensive-Text-Extraction-and-Analysis-for-Article-Metrics

Project Overview

Approach

Prerequisites

Setup

Running the Project

Dependencies

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages