zum Inhalt
MITTEILUNG

Reclaiming AI for Africa: A Reflection on Kathleen Siminyu’s Talk

Dr Ricardo O' Nascimento's reflection on the guest lecture AI in Africa at Interface Cultures.

On a recent guest lecture visit to Interface Cultures, I had the opportunity to hear Kathleen Siminyu, a researcher at the Distributed AI Research Institute and a renowned figure in the African AI landscape. Her talk was informative and inspiring, drawing connections between grassroots organising, AI research, and linguistic justice. What follows is an account of her presentation, the key topics she addressed, and the critical mission that characterises her work.

The talk started with Manuela Naveau from Interface Cultures acknowledging the diversity and interdisciplinarity of the group present in the talk. She warmly referred to the various students and researchers in attendance — from Austria, Germany, the UK, India, South Korea, Italy, Bosnia, Serbia, Slovenia, and Spain. She also explained that Kathleen came to Linz to be part of the S+T+Arts Prize Africa deliberations and framed this talk as part of a larger dialogue on inclusive, interdisciplinary knowledge-making, which is part of the ethos of Interface Cultures.

Kathleen positions herself at the intersection of natural language processing (NLP) research and activism. Her research focuses on developing AI tools for African languages — most of which are categorised as “low-resource” by mainstream AI due to the lack of digital data and processing tools. She explained how this technological underrepresentation is symptomatic of broader epistemic injustices. For her, data creation is not merely technical work but an act of resistance — a way to claim the presence and value of African knowledge systems in the digital age.

“The only reason we have ended up building datasets is because they don’t exist. So we have to build them.”

She shared her trajectory, starting with forming the Nairobi Women in Machine Learning and Data Science group. This grassroots initiative emerged in a context where few university-level AI courses were available. The group acts as a learning collective and a springboard for connecting to other key networks like Data Science Africa, Deep Learning Indaba, and Masakhane.


Deep Learning Indaba: Nurturing a Continental Movement
Kathleen talked about her experience with Deep Learning Indaba. The first edition took place in 2017 in South Africa, gathering researchers from across the continent for the first time. Kathleen attended and quickly became a co-organiser. She helped lead the 2018 and 2019 events, including a particularly memorable one in Nairobi, which welcomed over 700 participants. She shared a photo from that event and described it as a proud moment in her professional journey — a celebration of African excellence in AI.
She emphasised how Deep Learning Indaba evolved from a single-track program (e.g., “Introduction to Python”) to a diverse event, reflecting the growing expertise within the community. For instance, NLP workshops began to cater to participants who were building tools in their native languages.

From Baseline Models to Movement Building: The Power of Masakhane
Masakhane, an organisation Kathleen co-directs, emerged from the Deep Learning Indaba community. Its ethos is participatory, Pan-African, and open-access. The organisation began by focusing on machine translation. A pivotal moment occurred during an NLP workshop at Indaba 2019, where a seminal paper was presented that enabled anyone to build a baseline machine translation model using the JW300 dataset.
JW300, derived from Jehovah’s Witnesses’ multilingual Bible translations, was a treasure trove for African language representation. By simply changing the ISO language code in the training script, researchers could train a model in their language — dramatically lowering the barrier to entry. Over two years, Masakhane contributors produced translation models for 35 African languages, culminating in a landmark paper co-authored by over 40 people. The paper broke with convention by crediting technical contributors and also native speakers who supported annotation and evaluation. This inclusive authorship model became a cornerstone of Masakhane’s values.

Legal Grey Zones: Copyright, IP, and Ethical Challenges
While the success of JW300 was transformative, it also raised legal and ethical questions. Masakanne’s legal advisors discovered that the Jehovah’s Witnesses’ website prohibits data mining. Despite not being the dataset’s original curators, the team was advised to request retroactive permission, which was never granted. Ultimately, JW300 was taken offline. This case made evident the precarious legal terrain faced by African AI researchers — especially when data crosses international jurisdictions with differing copyright regimes.

“Africa is not a country. Each of our 54 nations has its own legal jurisdiction. How do we tell a researcher in Burundi one thing, and another in Kenya something else, when working on the same dataset?”

Kathleen also recounted frustrating encounters with national broadcasters and private media houses. For example, a South African broadcaster refused to provide access to multilingual news archives, even for purely academic use, citing commercial concerns. In Kenya, a private media company requested $30,000 for 8,000 sentences of scraped content — despite initial engagement and enthusiasm. Eventually, they negotiated a $5,000 fee just to publish the work. These examples spotlight how suspicion, market logic, and the absence of clear open data frameworks hamper progress in African AI.

Inclusion or Exploitation? The Illusion of Representation
Kathleen offered a critical perspective on the recent surge in claims of inclusivity by large tech companies. She described this as “performative inclusion.” LLMs like OpenAI’s Whisper or Meta’s No Language Left Behind tout African language support, but often fail basic benchmarks of quality or transparency. A Masakhane audit of web-crawled datasets like Common Crawl and Paracrawl revealed major issues: mislabeled data, sentences in the wrong language, and even meaningless content branded as African language text.

One example involved Meta’s use of the LESAN.AI benchmark for Ethiopian languages without citation. This had real-world consequences: the African startup behind the benchmark lost funding when investors wrongly assumed Meta had solved the problem.

The Dialect Dilemma: Preservation vs. Usability
Kathleen also spoke about the tension between preserving linguistic diversity and building tools people will use. She referenced work with Kiswahili dialects, where native speakers found writing in their own dialects unnatural due to education in standardised Kiswahili. This raises important questions: Should we prioritise revitalisation, even when practical use is limited? Is standardisation a form of erasure or a necessary step toward usability?

“Bias doesn’t start in the data alone — it starts with the interface. When your keyboard can’t even represent your language, what message does that send?”

She also shared a technical but striking example: Ge’ez, a script used in several Ethiopian and Eritrean languages, has more than 300 characters. Current keyboards and operating systems are ill-equipped to handle such complexity. This can lead to a narrowing of expression or even a change in how people write over time. That’s not a technical limitation — it’s a cultural one.

A Case Study in Data Sovereignty: Lessons from the Maori
Kathleen cited the example of Te Hiku Media, an organisation in New Zealand working on Maori language preservation. Instead of surrendering their data to global tech companies, they chose a long-term approach: training members of their own community in NLP to eventually build their own tools. This strategy ensured that control, knowledge, and value remained with the community.

Despite these efforts, the Maori language has since appeared in OpenAI’s Whisper model. Yet, the data’s provenance remains murky, and the community was not credited. Kathleen used this example to caution against unregulated data scraping, which disproportionately affects already marginalised groups.

Building Ethical Infrastructure
One of the most innovative parts of Kathleen’s talk was the introduction of an Open License framework developed through a collaboration between the Data science law lab at the University of Pretoria in South Africa and the Center for IP and IT at Strathmore University in Kenia. The framework aims to share datasets within the community while protecting them from exploitation by large corporations. Inspired by open-source principles but adapted to the data context, the license embeds terms for ethical use, reciprocity, and attribution.

Toward a Just AI Future
Kathleen closed with a powerful reminder: AI is never neutral. The technologies we build are shaped by the values, assumptions, and power structures behind them. Her work, and that of the Masakhane community, doesn’t aim to “catch up” with global AI — it seeks to reimagine it altogether.

One line has stayed with me since:
“The goal is not merely to be included in someone else’s system — but to build systems of our own.”

This talk was a reminder that another AI is possible — one rooted in care, justice, and the lived realities of the communities it serves.

ricardoonascimento.medium.com/reclaiming-ai-for-africa-a-reflection-on-kathleen-siminyus-talk-at-interface-cultures

Kathleen Siminyu sharing her experience at Interface Cutures