ZegnatThanks to sknebel for pushing me to actually sit down and do this. He and I have gone over the WHATWG innerText algorithm and the W3C CSS Text Module together, and have written two independent implementations for getting plain text from HTML (PHP and Python).
ZegnatWe might want to make more tests (although test coverage is actually pretty good! Over 90%!) and discuss a few specific things (do we want to implement tables? how many line breaks after paragraph elements?)
ZegnatAnd then I think we can start moving to getting this into the parsers. Because it treats block level elements etc all in one step it is a lot more generic than the previous algorithm.