Tools & Skills
- Python (pandas, json, regex, scikit-learn)
- NLP: spaCy, sentence-transformers, OpenAI embeddings, Bert, LSTM, GRU, word2vec, Gensim, e.t.c
- Familiarity with speech data workflows (Whisper, Coqui, AWS/GCP Speech APIs)
- Basic clustering and similarity search (KMeans, cosine similarity, Faiss)
- SQL/NoSQL (PostgreSQL, Redis) for data storage
- Bonus: experience with annotation tools (Label Studio, Prodigy)
- Data presentation experience using Tableau and  Power BI, ETL Pipeline experience using Athena, Glue, Spark, s3
First 3â6 Months
- Help seed Genyâs initial voice dataset: transcribe & annotate ~10K utterances from early pilots
- Build scripts to cluster similar commands and link them to intents
- Evaluate Genyâs performance across accents (US, Canadian French, Spanish, African,Asian, English)
- Create baseline metrics dashboards (WER, intent accuracy, fallback rate)
- Support senior engineers in retraining/fine-tuning models with your datasets