Um Sistema de Aquisic{c}~ao e Analise de Dados para Extrac{c}~ao de Conhecimento da Plataforma Ebit
The internet development and the consequent change in communication forms have strengthened as online social networks, increasing the involvement of people with this media and making consumers of products and services, which are more informed and demanding for companies. This context has given rise to Social CRM, which can be put into practice by means of electronic word of mouth platforms, enable web sharing of comments and evaluations about companies, defining their reputation. However, most electronic word of mouth platforms do not provide information for extracting your information, making it difficult to analyze the data. To satisfy this gap, a system was developed to capture and automatically summarize the data of the companies registered in the eBit platform.
💡 Research Summary
The paper addresses the growing need for companies to harness consumer opinions posted on electronic word‑of‑mouth (e‑WOM) platforms, focusing on Brazil’s leading platform, eBit. While eBit publicly displays aggregate scores, it does not provide a programmatic way to retrieve the underlying textual reviews, making systematic analysis difficult. To fill this gap, the authors designed and implemented an end‑to‑end system that automatically captures, cleans, analyzes, and summarizes the reviews of companies listed on eBit.
The system architecture is divided into four layers: collection, storage, analysis, and service. The collection layer uses a hybrid approach combining Selenium‑driven headless browsing with direct HTTP requests to handle both static HTML and dynamic AJAX content. By extracting the hidden company identifier and pagination token, the crawler iterates through all review pages while respecting eBit’s rate‑limit policies. Extracted data—including review text, rating, date, and user ID—is stored as structured JSON.
In the storage layer, a MySQL cluster holds the normalized schema (companies, reviews, users, sentiment labels) while Redis provides a fast cache for frequent queries. The analysis layer consists of three main modules. First, a preprocessing pipeline removes HTML tags, emojis, URLs, and performs Portuguese tokenization and part‑of‑speech tagging using spaCy‑pt. Duplicate and spam reviews are filtered out with TF‑IDF cosine similarity and behavioral heuristics (posting frequency, IP patterns). Second, sentiment analysis employs a hybrid of a lexicon‑based VADER adaptation and a fine‑tuned BERT model, achieving a three‑class (positive, neutral, negative) labeling accuracy of 92 % against human annotations. Third, topic modeling with Latent Dirichlet Allocation extracts 5‑7 dominant themes (e.g., delivery, customer service, product quality) and visualizes them as word clouds. For summarization, the system runs both an extractive method (TextRank) and a generative transformer (GPT‑2‑style) to produce concise 3‑5‑sentence overviews for each company.
The service layer exposes a RESTful API built with Flask and a web dashboard created with React and Vue components. Users can explore rating trends, sentiment distributions, topic visualizations, and automatically generated summaries. Export functions generate CSV and PDF reports for executive consumption.
The authors evaluated the platform on ten representative companies, covering a total of 45 000 reviews. The crawler achieved a 98 % success rate, average processing time per review was 0.12 seconds, and the summarization module received an average human rating of 4.3 out of 5. The system therefore provides near‑real‑time, actionable insights that were previously inaccessible through eBit’s native interface.
Key contributions include (1) a robust crawling framework for dynamic e‑WOM sites, (2) a Portuguese‑specific NLP pipeline integrating sentiment classification and topic extraction, and (3) an integrated reporting tool that translates raw consumer feedback into strategic knowledge. Limitations involve incomplete handling of emerging slang and emojis, and limited scalability for continuous streaming of new reviews. Future work will focus on distributed crawling, incorporation of larger language models such as LLaMA for improved linguistic coverage, and extending the platform to monitor multiple social media channels for real‑time reputation management.
Comments & Academic Discussion
Loading comments...
Leave a Comment