Julio S. Solís Arce presents "Ethnographic Perspectives: A Large-Scale Cross-Cultural Dataset".
Publication information:
Abstract
Abstract:
The study of culture in social science research increasingly relies on large-scale data, yet existing resources remain constrained by limited source diversity and a lack of detailed information on human societies. In this project, we are building and introducing a novel cross-cultural dataset that incorporates a broad set of cultural, economic, and social attributes derived from diverse ethnographic sources, including accounts from the eHRAF World Culturesdatabase and the Hathitrust Digital Library, as well as anthropological journal publications available on JSTOR. We automate the variable coding process by leveraging recent advancements in retrieval-augmented generation (RAG) systems. We then develop a pipeline to validate both our retrieval and generative processes using human evaluations and historical corrections to existing ethnographic datasets. In addition to providing the research community with new data to illuminate pre-industrial societal characteristics, we establish a systematic framework for variable coding that researchers can adapt for broader cultural analysis.
Full text
Abstract:
The study of culture in social science research increasingly relies on large-scale data, yet existing resources remain constrained by limited source diversity and a lack of detailed information on human societies. In this project, we are building and introducing a novel cross-cultural dataset that incorporates a broad set of cultural, economic, and social attributes derived from diverse ethnographic sources, including accounts from the eHRAF World Culturesdatabase and the Hathitrust Digital Library, as well as anthropological journal publications available on JSTOR. We automate the variable coding process by leveraging recent advancements in retrieval-augmented generation (RAG) systems. We then develop a pipeline to validate both our retrieval and generative processes using human evaluations and historical corrections to existing ethnographic datasets. In addition to providing the research community with new data to illuminate pre-industrial societal characteristics, we establish a systematic framework for variable coding that researchers can adapt for broader cultural analysis.