Sentixtract

Sentixtract is a Python repository designed for scraping articles from URLs and performing Natural Language Processing (NLP) techniques, focusing primarily on sentiment analysis. This tool extracts textual content from web articles, cleans and processes the text data, and computes sentiment scores and other linguistic metrics to derive insights from the extracted content.

Features

Article Extraction: Automatically fetches articles from specified URLs and saves them locally.
Text Preprocessing: Cleans and tokenizes text data, removing non-alphabetic characters and stop words.
Sentiment Analysis: Calculates sentiment scores based on the presence of positive and negative words.
Readability Metrics: Computes readability scores such as average sentence length and Fog Index.
Additional Linguistic Metrics: Calculates syllable counts, personal pronouns, and other linguistic features.
Progress Tracking: Utilizes tqdm to display progress bars for tasks like data extraction and analysis.

Dependencies

Python (3.11.3 recommended)
pandas (2.0.3)
requests (2.31.0)
tqdm (4.65.0)
beautifulsoup4 (4.12.2)
nltk (3.8.1)

Installation

Clone the repository:

git clone https://github.com/yourusername/Sentixtract.git
cd Sentixtract

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare your input data in an Excel file (`input_data.xlsx') containing columns 'URL' and 'URL_ID'

Run the main script to extract articles and analyze them:

python sentixtract.py input_data.xlsx --article_dir articles --master_dict_dir MasterDictionary --output_file Output.xlsx

Replace input_data.xlsx with your input file path and adjust directory paths as necessary. 3. Review the generated output in Output.xlsx for sentiment scores, readability metrics, and other analyzed features.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Articles		Articles
MasterDictionary		MasterDictionary
StopWords		StopWords
Input.xlsx		Input.xlsx
Output Data Structure.xlsx		Output Data Structure.xlsx
Output.xlsx		Output.xlsx
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentixtract

Features

Dependencies

Installation

Usage

About

Releases

Packages

Languages

PranjalDhamane/Sentixtract

Folders and files

Latest commit

History

Repository files navigation

Sentixtract

Features

Dependencies

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages