Data Extraction and NLP Analysis Assignment
Overview
This project involves extracting text data from given URLs and performing textual analysis to compute various linguistic and readability metrics. The main objective is to ensure accurate data extraction and insightful text analysis using Python programming.
Objective
The primary goal of this assignment is to extract the text from provided articles and analyze the content to compute several text-related variables. These variables provide insights into the sentiment, readability, and linguistic complexity of the articles.
Data Extraction
Input
File: Input.xlsx
Task: Extract the title and main content of each article from the provided URLs.
Output: Save each extracted article in a text file named after the URL_ID.
Method
Tools: Python, BeautifulSoup, Selenium, Scrapy (or any preferred library for web scraping)
Instructions:
Extract only the article title and text.
Avoid extracting headers, footers, advertisements, or any non-article content.
Data Analysis
Variables to Compute
The analysis involves computing the following variables for each extracted article:
Positive Score
Negative Score
Polarity Score
Subjectivity Score
Average Sentence Length
Percentage of Complex Words
Fog Index
Average Number of Words per Sentence
Complex Word Count
Word Count
Syllable per Word
Personal Pronouns
Average Word Length
Method
Tools: Python, NLTK, regex
Instructions:
Perform textual analysis as per the definitions provided in Text Analysis.docx.
Save the analysis results in the specified format given in Output Data Structure.xlsx.
Output Structure
File: Output Data Structure.xlsx
Content:
The output file should contain all input variables from Input.xlsx.
Include all computed variables in the specified order.
Submission Guidelines
Timeline: Complete the project within 6 days.
Submission:
Fill out the Google form to submit your solution.
Upload your solution to Google Drive and share the drive URL in the form.
Include the following files in your submission:
A .py file containing the Python code.
An output file in CSV or Excel format as per the given structure.
A README file with instructions on running the .py file and generating the output, including all dependencies.
Approach and Instructions
Extract URLs from Input.xlsx:
Read the URLs from the provided Excel file.
Scrape Article Data:
Use BeautifulSoup to extract the article title and main content.
Save the extracted content in text files named after the URL_ID.
Perform Text Analysis:
Tokenize the text to compute word counts, sentence lengths, and other metrics.
Calculate sentiment scores using predefined positive and negative word lists.
Compute readability metrics like the Fog Index.
Save Analysis Results:
Store the computed variables in the specified output format.
Ensure Accurate and Clean Data:
Validate the data extraction process to avoid non-article content.
Cross-check the computed variables for consistency.
Негізгі бет Data Extraction and NLP Analysis Project v1.0
Пікірлер